Beruflich Dokumente
Kultur Dokumente
z^r0r RAZAI(
Date Ztty(u r"
Name: IECTUIER
SYSTEU & COMPUTER TECHNILOGY DEP^RTMENT
^ FACULTY qF SCIENCE^NO
Ueslgnatlon
TECHNOLOGY
INFORMATION
UNIVERSIIY oF MALAYA
50603 KUALA LUMPUR
ABSTRACT
Automated speech recognition for Quranic verse recitation with Tajweed checking
rules capabilities is a new research area. It is because, the current method of Quranic
learning process through manual method of Al-Quran reading skills, become less effective
and unattractive to be implemented, especially towards the young Muslim generation. This
method also known as talaqqi and musyafahah method, which described as face to face
learning process between students (Recitors) and teachers (Mudarris), where the process of
listening, correction and repetition of the correct Al-Quran recitation took place in real time
condition. Automated speech recognition system with tajweed checking rules capability
could be another alternative due to support the existing method of manual skills of Quranic
learning process, without denying the main role of Mudarris in teaching Al-Quran. This
system is not intended to replace the Al-Quran, nor will replace the role of teachers, but to
complement the teaching process and to ensure that the art of reciting Al-Quran is not lost
and forgotten. In this thesis, an automated Tajweed checking rules engine for Quranic verse
recitation was develop and tested, due to present the easiest way to Muslim to recite and
learn Al-Quran, with better understanding of Tajweed. Feature extraction technique of Mel-
characteristics from Quranic verse recitation, as well as Hidden Markov Model (HMM) for
training and recognition purposes. Most challenging task in this research is to implement
Al-Quran with speech recognition system, altogether with the engine capability in checking
the tajweed rules. However, this engine able to achieve recognition rate that exceeded
91.95% (ayates) and 86.41% (phonemes), which indicates that the development of this
ii
ABSTRAK
Tajwid, khusus bagi pembacaan ayat-ayat suci Al-Quran telahpun dibangunkan dan ianya
merupakan satu bidang yang masih lagi dianggap baru. Penyelidikan ini dilaksanakan,
ekoran timbulnya masalah dalam system pembelajaran dan pengajaran Al-Quran sedia ada
serta kaedah yang digunapakai sekarang ini, iaitu secara manual melibatkan para pelajar
dan guru-guru (Mudarris) itu sendiri. Kaedah ini dipercayai kurang berkesan serta kurang
daya tarikan untuk dilaksanakan, terutamanya terhadap generasi muda Islam. Pendekatan
kaedah pembelajaran ini diadaptasi daripada salah satu bentuk pembelajaran Al-Quran
secara Talaqqi dan Musyafahah, iaitu dikenali sebagai pembelajaran secara bersemuka di
antara para pelajar dan guru-guru (Mudarris). Melalui kaedah ini, segala proses
kembali pembacaan dengan lancar dan bertajwid berlaku. Sistem automatik dengan
kemampuan seta keupayaan untuk menilai hukum-hukum Tajwid pada bacaan Al-Quran
merupakan salah satu bentuk alternatif lain bagi menyokong kaedah pembelajaran Al-
Quran yang sedia ada iaitu secara manual, tanpa mengabaikan atau mempertikaikan
peranan utama Mudarris dalam pengajaran Al-Quran. Sistem yang dibangunkan ini tidak
peranan guru, tetapi fungsinya lebih cenderung untuk melengkapi proses pembelajaran
sedia ada ketika ini dan memastikan bahawa seni bacaan Al-Quran itu sendiri tidak hilang
dimamah usia serta tidak mudah dilupakan begitu sahaja. Dalam tesis ini, enjin
dibangunkan serta diuji kemampuannya, menerusi pengenalan kepada satu kaedah terbaru
iii
yang paling mudah untuk digunapakai oleh masyarakat Islam melalui pemahaman yang
lebih baik dalam mempelajari Al-Quran. Teknik pengekstrakan fitur menggunakan Mel-
frequency Cepstral Coefficient (MFCC) telahpun digunapakai dalam kajian ini, dimana
fitur dan ciri-ciri yang terdapat pada bacaan ayat-ayat suci Al-Quran diekstrak, manakala
klasifikasi Hidden Markov Model (HMM) pula digunakan bagi tujuan latihan dan
pengecaman. Tugasan yang paling mencabar dalam melaksanakan penyelidikan ini adalah
ketika proses implementasi ayat-ayat Al-Quran pada sistem pengecaman suara, ditambah
dengan keupayaannya untuk memeriksa hukum-hukum Tajwid. Namun begitu, enjin yang
telah dibangunkan ini berupaya mencapai kadar pengecaman yang tinggi melebihi 91.95%
(ayat) dan 86.41% (perkataan), di mana ia menunjukkan bahawa enjin yang telah
iv
ACKNOWLEDGEMENTS
All praise is due to Allah, the Creator and Sustainer of this whole universe, the
Most Beneficent and the Most Merciful, for His guidance and blessing and granting me
Department of Computer System & Technology for providing me support to carry out this
research. I take great pride to forward my sincere appreciation and deepest gratitude to my
supervisor Mr. Zaidi Razak, for his valuable guidance, support, encouragement and effort
throughout this research project. Without his tireless efforts, patience and guidance, this
research could not have been successfully completed. My special thank also dedicated to
my project leader, Prof. Dato’ Dr. Mohd Yakub @ Zulkifli Bin Haji Mohd Yusoff, for his
I would like to take this opportunity to wish thank you to University of Malaya in
the recipient of this scholarship, which support my financial life and funded my studies,
v
Last but not least, most profound gratitude and respect to my family, especially my
beloved parents, Haji Ibrahim Bin Husain and Hajjah Maimunah Muda, who have been the
proudly dedicate this work to both of them, may Allah SWT bless both of them.
April 2010
University of Malaya,
Kuala Lumpur.
vi
TABLE OF CONTENTS
Page
ABSTRACT ii
ABSTRAK iii
ACKNOWLEDGEMENTS v
LIST OF TABLES xv
CHAPTER 1: INTRODUCTION
1.1 Introduction 1
1.2 Background 2
1.3 Motivation 3
1.8 Terminology 6
1.8.1 Utterances 7
1.8.2 Vocabularies 7
1.8.3 Accuracy 7
vii
1.9 Thesis Outline 8
2.1 Introduction 11
2.5.1 Pre-processing 23
Smoothing
Equalization
(MFCC)
Techniques
viii
(a) HMM Training 31
2.5.4 Recognition/Identification 34
2.7 Summary 38
3.1 Introduction 39
3.2.2.1 Preemphasis 46
3.2.2.2 Framing 48
3.2.2.3 Windowing 49
ix
3.2.3 Hidden Markov Model Classification 56
(a) Initialization 60
(c) Re-Estimation 69
(a) Initialization 75
3.3 Summary 79
4.1 Introduction 81
4.7 Summary 99
x
CHAPTER 5: EXPERIMENTAL RESULTS AND DISCUSSION
REFERENCES 128
APPENDIX A 134
APPENDIX B List of Published Papers and Achievements 139
xi
LIST OF FIGURES
Page
Figure 3.13: The MFCC Cepstral Coefficients for ayates ‘Maaliki yawmid 55
diini’
xii
Figure 3.15: The HMM sequence of training block diagram 60
Figure 3.16: The state transition probability matrix (A) for ayates 61
Figure 3.17: MATLAB code for initialize the model (mu, sigma) 62
Figure 3.19: The mean vectors mu (µ), for ayates ‘Maaliki yawmid diini’ 63
Figure 3.26 (a): Output score for the ayates ‘Maaliki yawmiddiini’ 78
‘Maaliki yawmiddiini’
engine
xiii
Figure 4.5: Tajweed Checking Rules Engine Data Flow Diagram (DFD) 88
Figure 4.6: Automated Tajweed checking rules engine for Quranic flow chart 89
Figure 4.7: Automated Tajweed Checking Rules Engine for Quranic verse 92
Figure 4.8: Load the wave file of input speech sample from sourate Al-Fatihah93
Figure 4.11: The input speech sample and spectrogram graph for ‘Bismillah’ 95
utterance
(1st mistake/notification)
Figure 4.13: The incorrect recitation part involved and Tajweed rules 96
(2nd mistake/notification)
Figure 4.15: The incorrect recitation part involved and Tajweed rules 97
utterance
Figure 5.1: Percentage of accuracy for recognition rate (Ayates & Phonemes) 119
Figure 5.2: Percentage of Word Error Rate (WER) for ayates & Phonemes 120
xiv
LIST OF TABLES
Page
recognition techniques
Table 5.2: Summary of the Total Collected Speech Samples for each Ayates 102
Table 5.3: Template Data of HMM Model for Collected Quranic Recitations 104
Table 5.5: Result of Likelihood Ratio (LLR) for 8 recitations of speech 111
Table 5.7: Comparison between correct and incorrect Tajweed rules 114
Table 5.8: Comparison between correct and incorrect Tajweed rules 115
Table 5.9: Test result for 28 recitations of speech samples (Phonemes) 118
xv
LIST OF ABBREVIATIONS
CN : Channel Normalization
FS : Sampling Frequency
Hz : Hertz
IV : In Vocabulary
J-QAF : Jawi, Quran, Arabic and Fardhu Ain (Islamic obligatory duty)
NN : Neural Network
xvi
OOV : Out of Vocabulary
PC : Personal Computer
VQ : Vector Quantization
xvii
CHAPTER 1
INTRODUCTION
1.1 Introduction
a great impact to our daily life. Furthermore, the problem of communication between
human being and information technology had become so critical nowadays. Until now, this
communication had been completely done by using keyboard and screens, but there are
considered as the most widely used and natural means of communication between human,
and it is an obvious substitute for such means of keyboard and screens in the
communications process. Moreover, the process of exchanging the ideas among the human
were carried out with the aid of communication and has facilitated the development of
technology into the various form. Although speech applications in the computer machines
interface area has been growing drastically, but the processing forms capabilities for
generating and interpreting speech is still incomplete and not perfect. Investigations in this
research field have led into the development of automatic speech recognition systems.
with the various components in Artificial Intelligence; natural languages processing, speech
1
recognition technology and human computer interaction fundamentals. Here, this research
is concerned with speech recognition technology, which is part of speech and signal
processing technology.
1.2 Background
methodologies are essential in putting the word of God in its rightful place. The
development of Quranic lesson learns have been successfully produced a lot of Quranic
scholars and at the same time promoting the Quranic standard into high priority level. The
development of the ICT also has change the world into many ways, both positive and
negative aspects. Therefore, each of Muslim must be able to identify the appropriate and
practical ways of selecting the right type of information obtained from this new technology.
Even though the world has changed drastically, the development in Quranic studies have
never become outdated. World globalization era as well as high technology, also could not
prevent the academia in Quranic studies from been influenced by the current trends of
technology.
believed that, this recognition system invented was capable to educate the students and
adults by using the interactive learning system with Tajweed checking rules (Al-Quran
are only capable to show Al-Quran texts and/or play stored Al-Quran recitation, while this
2
system offers students to recite Al-Quran through the system and the recitation will be
It is believed that, Al-Quran learning process required the special and effective way
to recite Al-Quran (Tabbal et al., 2006). Furthermore, Al-Quran learning process is still
handled with manual method, based on Al-Quran reading skills through talaqqi and
musyafahah methods. These methods are described as face to face learning process
recitation and recite again the correct Al-Quran recitation took place (Berita Harian, 2005).
This method is so important to be implemented, so that the Muslim will know how the
hijaiyah letters are correctly pronounced. The process only can be done, if the Mudarris and
also the recitors follow the art, rules and regulations while reading Al-Quran, known as
1.3 Motivation
Through this method, Mudarris required to check the tajweed rules of their
handle the students prior with a large amount of students per classes. The
3
targeted objectives in j-QAF were going to be difficult to achieve, due to
(ii) Shortage of ICT applications in teaching and learning process may bring a
(iii) Current busy lifestyle needs a modern and technological approach for self-
learning method to recite Al-Quran, which can improve the learning process
learning tool which independently capable to evaluate the user reading and
performances.
(i) To define the most suitable algorithm for feature extraction and recognition,
4
(ii) To determine the most accurate recognition process that suite the Quranic
verse recitation.
(iii) To develop an engine that combines feature extraction and recognition, due
Tajweed checking rules engine only check the basic rules of Tajweed and “Mad” in
This project is totally 100% software based system and did not involve with any hardware
implementation. Thus, only MATLAB coding, simulation and GUI modeling involved in
this research.
This automated Tajweed checking rules engine for Quranic verse recitation is
typically designed, mainly to guide and assist the user, specifically Muslim user during
reading Al-Quran. The aim of this system is to facilitate the recitors during Al-Quran
learning process focused on Quranic recitation based on ‘Rules of Tajweed’. Meaning that,
5
the system created capable to check the tajweed rules based on stored database and
recognize the particular sourate in Al-Quran, which may recite by recitors either correct or
not, based on the Tajweed rules guidelines. This research is carried out in different stages
as described below:
(ii) Extract the features from the collected Quranic recitation of speech samples
(iii) Train the features vectors against the initial/available database, in order to
(iv) Recognize/Match as well as testing the unknown features vector against the
1.8 Terminology
The following definitions are the basics needed for understanding speech
recognition technology. Besides, these definitions also would probably can be acts as
6
1.8.1 Utterances
to the computer. Utterance can be a single word, a few words, a sentence or even
the multiple sentences, as long as it has a single meaning to the computer (Oxford
English Dictionary (11th Edition)). Here, the variability of the Quranic Arabic
between Arabic countries and even dialectical difference in the same country causes
1.8.2 Vocabularies
speech recognition system. In fact, small dictionary are easier for computer to
recognizes, while the large dictionary were more difficult. Moreover, the Arabic
language is morphologically rich, thus causing a high vocabulary growth rate. This
high growth rate is problematic for language models by causing a large number of
out-of-vocabulary words.
1.8.3 Accuracy
The efficiency and the ability of the system recognizer or speaker can be
speech recognizers. It includes not only correct identifying of utterances, but also
7
identify either the spoken utterance is in its vocabulary or not. The acceptable
This thesis contains of 6 chapters, including the introduction chapter. Each chapter
is subject to certain scopes, which formulate the thesis contents. Below are the chapter
Chapter 1: Introduction
Chapter 1 present the definition and background of the project, including the
methodology, terminology and thesis outline, which outline the scope and
that related and relevant with this research, in terms of commonly used of
system.
8
Chapter 3: Research Methodology
methodology used in this research. The sub-topics for this chapter include
rules engine for Quranic verse recitation. The sub-topics for this chapter
include the research design of this engine and its implementation, as well as
many other diagrams that represent the logical and physical designs of the
systems.
information, analysis and discussion of the result obtained, after the training
9
Appendix A:
Appendix B:
exhibitions.
10
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Speech recognition is one of the most important areas in digital signal processing.
In speech recognition, the scope of research area also involved with ‘artificial intelligent’ of
the system or machines itself, that may able to ‘hear’ and ‘understand’ the spoken
information from the particular recitation. Recently, automatic speech recognition has
reached a very high standard and performance for the past 5 years. Moreover, speech
The main area that believes can contribute towards the effectiveness of this research project
focus on speech recognition is the part of pattern recognition technology. However, speech
recognition has some problem, which belongs to a much broader scientific topic called
pattern recognition or pattern matching. According to Huang et al. (2001), spoken language
processing relies on pattern recognition, which is one of the most challenging problems for
machines.
In this chapter, the general concepts that are related to Quranic Arabic accent were
reviewed and the significances that motivate in conducting this research were presented in
the subsequent chapters of this thesis. First, the art of Tajweed in Al-Quran were discussed
and presented. This provide a short and brief description about the ‘art’, which totally
11
different in term of language in Arabic, that is recognizably unique towards a set of
discussed. Experimental studies from the literature are presented, which shows the gap and
differences between written and recite Al-Quran. Next, the brief discussion about the effect
of the “Art of Tajweed” on the acoustic model, which can influence the recitation
recognition aspect in checking the tajweed rules. Those effects, formerly related with the
Arabic linguistic properties, which then will be discussed elaborately in the next part of 2.4.
Finally, some key related to the research were highlights, as well as algorithms and
techniques that are relevant to this research. Various types of feature extractions,
well as recitation at a moderate speed. It is a set of rules which govern on how Al-Quran
should be read (Bashir, M.S. et al., 2003). It is considered as art because not all recitors will
perform the same recitation of Quranic verse in the same way (Tabbal, H. et al., 2006).The
“art of tajweed” defines with some of flexible well-defines rules to recite Al-Quran. Those
rules create a big difference between normal Arabic speeches and recited of the Quranic
verses, which may produce the interesting result based on the impact of “art” analysis for
the automatic speech recognition process, especially on the acoustic model. Furthermore,
the “Art of Tajweed” is the manual methods that need a lot of work and proved to be
12
unable to adapt to new recitors. However, it is still believe that, the special way to recite
Al-Quran is by looking forward the art of tajweed (Bashir, M.S. et al., 2003).
As we already know, each person’s voice is different. Thus, Al-Quran sound which
had been recited by most of recitors will probably tends to differ a lot from one person to
another. Although those Quranic sentence were particularly taken from the same verse, but
the way of the sentence in Al-Quran been recited or delivered may be different (Tabbal, H.
et al., 2006). It may produce the difference sounds for the different recitors. Moreover,
there are many difficulties arise when dealing with the specialties of the Arabic language in
Al-Quran, regarding to the differences between written and recite Al-Quran. Those same
combinations of letters may be pronounced differently due to the use of harakattes (Tabbal,
H. et al., 2006). The most important tajweed rules that believed can influence the recitation
13
The above laws are based from the specific recitation rules. Moreover, the
predefined “maqams” also been used by recitors to vary the tone of their recitations
(Tabbal, H. et al., 2006). There are 10 different laws set according to the 10 certified
scholars, such as Hafs, Kaloun, Warsh, Shu’bah, Hicham, Ibn-Dhakwan, Al-Duri, Al-Susi,
Al-Bazzi and Kunbul, who teaches the recitation of the Holy Quran (Habash, M., 1998). In
order to deal with these laws, the prolongations as the repetition of the vowel n-
al., 2006). This rule governs the consonants/vowel combinations, usage of short and long
and Ghonna rules, as well as rules for combining words (Ahmed, M.E., 1991). Note that, if
there any echoing sound produced during the Quranic recitation recording process, the echo
will be considered as noise. That noise can be eliminated using the noise-canceling filter
language of religious instruction in Islam, many more speakers have at least a passive
knowledge of the language. Arabic is one of the languages that are often described as
morphologically complex and the problem of language modeling for Arabic are multipart
by the variation of dialectal (Vergyri, D. & Kirchhoff, K., 2004; Maamouri, M. et al., 2006;
Kirchhoff, K., et al., 2004). However, only Modern Standard Arabic (MSA) is used for
written and formal communication. It is because only MSA has a universally agreed upon
14
the writing standard as well as for communication purposes (Vergyri, D. & Kirchhoff, K.,
2004; Maamouri, M. et al., 2006; Kirchhoff, K. et al., 2004; Kirchhoff, K., 2002).
As mentioned earlier in part 2.3, there are many difficulties begin when dealing
with the specialties of the Arabic language in Al-Quran, due to the differences between
written and recite Al-Quran (Tabbal, H. et al., 2006; Maamouri, M. et al., 2006; Kirchhoff,
K., et al., 2004). The Quranic Arabic alphabets consist of 28 letters, known as hijaiyah
letters (from alif (…)اuntil ya (( ))يVergyri, D. & Kirchhoff, K., 2004; Kirchhoff, K., et al.,
2004). Those letters includes 25 letters, which represent consonants and 3 letters for vowels
(/i: /, /a: /, /u :/) and the corresponding semivowels (/y/ and /w/), if applicable. A letter can
have two to four different shapes: Isolated, beginning of a (sub) word, middle of a (sub)
word and end of a (sub) word (Kirchhoff, K. et al., 2004). Letters are mostly connected and
there is no capitalization. The letter is represented as below at table 2.1, in their various
forms.
Table 2.1: The Arabic alphabets (from Ramzi, A.H. & Omar, E.A., 2007)
15
Table 2.1: The Arabic alphabets (Continued)
16
Furthermore, other phonemes of pronunciation are marked by diacritics, such as
consonant doubling (phonemic in Arabic). It is indicated by the “shadda” sign and the
“tanween”, word final adverbial markers which add /n/ to the pronunciation (Maamouri, M.
et al., 2006; Kirchhoff, K., 2004), as shown below in table 2.2. Those signs can reflect the
differences of pronunciation. Moreover, the diacritics are really important in setting up the
grammatical functions, which leading to the acceptable text understanding and correct
reading or analysis (Maamouri, M. et al., 2006). The entire set of diacritics is listed in table
2.2 below:
Table 2.2: Arabic diacritics (from Vergyri, D. & Kirchhoff, K., 2004)
17
Some Arabic letters may have an additional character called Hamza. Another non-basic
character is Taa-Marbuwta which is always at the end of the word. The Arabic language
has a very large vocabulary. Arabic characters may have diacritics which are written as
strokes above or below the character, which can change the pronunciation and meaning of
According to the figure 2.1 shown above, each number represents certain characteristic as
listed below:
al., 2002)
number of back consonants. This type of consonants can cause a complex co-articulation
phenomenon in Arabic speech. Besides, a set of allophone as well as the consonants letters
18
(Ahmed, M.E., 1991; Youssef, A. & Emam, O., 2004) also described, which had been
19
Group B: The Pharyngeals: /q/, /x/, and /γ/; and /r/.
Group C: The Madd letters: Alif, ““ أ, Ya’a, “”ی, Waw, “”ﯣ.
Group D: The rest of the letters (except the pharyngealized Lam /L/).
Group F: Ash-Shamsi letters: /t/, /Ө/, /d/, /∂/, /z/, /s/, /∫/, /S/, /D/, /T/, /∂/, /l/, /n/.
Group G: Al-Qamari letters: /E/, /b/, /dz/, /H/, /x/, /? /, /γ/, /f/, /q/, /k/, /m/, /w/, /h/
Group I: Ikhfa’a letters: /t/, /Ө/, /s/, /∫/, /dz/, /d/, /∂/, /z/, /S/, /D/, /T/, /∂/, /f/, /k/, /q/.
Group J: Voiceless Fricative consonants: /f/, /Ө/, /s/, /∫/, /h/, /H/, /S/, /x/.
Letter to sound conversion for Arabic usually has a simple one to one mapping
between orthography and phonetic transcription for given correct diacritics. 14 vowels had
been used to accommodate for short and long vowels, same as well as the emphatic vowels.
Each syllable begins with a consonant followed by a vowel, which are limited and easily
detectable. Short vowels are denoted by “V” and long vowels are denoted as “V:” (Ahmed,
M.E., 1991; Youssef, A. & Emam, O., 2004; Essa, O., 1998). Those syllable can be
classified according to the length of the syllable, which also known as harakattes (Tabbal,
H. et al., 2006).
CV Short ; open
20
CV: C Long ; closed
According to the research, the project is mainly focus on the basic of speech
recognition technology, but it will implemented into the different type of application or
languages such as Arabic in Al-Quran. Quranic Arabic recitation is best described as long,
slow pace rhythmic, monotone utterances (Kirchhoff, K. et al., 2003). The sound of
pronunciation rules, tajweed, designed for clear and accurate presentation of the text.
recitation recognition, which covered Al-Quran verse delimitation system in audio files
using speech recognition techniques. Here, the Quranic recitation and pronunciation as well
as software used for recognition purposes had been discussed. The Automatic Speech
Recognizer (ASR) has been developed by using the open source Sphinx framework as the
basis of this research. The scope of this project more focus towards the automated
delimiter, which can extract the verse from the audio files. Research techniques for each
phase were discussed and evaluated using the implementation of various techniques for
different recitors, which recite sourate “Al-Ikhlas”. Here, the most important Tajweed rules
and Tarteel, which may influence the recognition of a specific recitation, can be specified.
21
A comprehensive evaluation of the Quranic verse recitation recognition techniques
was provided by Ahmad, A.M. et al. (2004). The survey provides recognition rates and
descriptions of test data for the approaches considered. The Quranic Arabic recitation
recognition that is incorporates with the background on the area, discussion of the
techniques and potential research directions. Here, Recurrent Neural Network with Back
Differences of each Arabic's letters from alif ( )اuntil ya ( )يhave been observed based on
performance of cepstral analysis and recognition effectiveness. In general, there are five
major stages in a speech recognition system. Under the same techniques of speech
recognition, the Quranic Arabic recitation recognition also can be implemented based on
processing, pre-processing steps are essential. The main benefit in pre-processing of speech
recognition is to organize the information and simplify the following task of recognition.
1. Endpoint Detection.
parameter with the augmentation of zero-crossing rate, pitch and duration information in
endpoint detection algorithms. However, recently the endpoint detection features become
less reliable in the presence of non-stationary noise and various type of sound artifact
(Shen, J. et al., 1998). It is because the endpoint detection and verification of speech
The purpose of the smoothing stage is to decrease the noise and regularize the word
contours. Ahmad, A.M. et al. (2004) are also digitized the Arabic’s alphabets from speaker,
as well as digital filtering. The digital filtering may emphasize the important frequency
23
component in signal. Then the start-end point can be analyzed based on the signal of the
phonemes. Here, GoldWave audio editor software has been used to filter the input speech
signal from analog to digital signals, due to analyze the start-end points that contain
information of speech.
According to Tabbal, H. et al. (2006), the use of 2-stage pre-emphasis filter with the
different factor value (0.92 and 0.97) could increase the recognition ratio of some audio
files. It is due to the speech frame of 10ms and a threshold of 10dB for the speech extractor
chosen. It also can consider as the noise canceling filter, due to eliminate echo (noise).
Besides the pre-emphasis filtering, there is another technique used by Kirchhoff, K. et al.
models for each of the stream with different morphology. It believes can outperform other
Another approach used for pre-processing method was also known as Channel
techniques has been developed with different applications domains, where a particular
recognizer is trained with speech recorded using the microphone. The recognition is
attempted based on speech recorded with the different microphone. Here, the contribution
of the channel normalization during the training is still unknown in details, but, it is still
24
2.5.2 Feature Extraction
characteristics from the speech signal that are unique, discriminative, robust and
computationally efficient to each word, which then used to differentiate between different
words (Ursin, M.,2002). According to Martens, J.P. (2002), there is various speech of
4. Spectrographic Analysis
Ahmad, A.M. et al. (2004) used this type of extraction technique to extract the LPC
coefficients from the speech token. Then, the coefficients are converted to cepstral
coefficient that served as input to the neural networks. The drawback of LPC may estimate
the high sensitivity to quantization noise. By converting the LPC coefficients back into
cepstral coefficient, it can decrease the sensitivity of high and low order cepstral coefficient
to noise.
25
According to the Ahmed, M.E. (1991), LPC model had been replaced with a
formant that has much wider frequency spectrum. It is believed that, the LPC synthetic
model can give a bad outcome for the research, due to deduce the prosodic rules. This rule
is very important rules of missing blocks, in order to construct an allophone based Arabic
text-to-speech by rules.
which had been used by Vuuren, S.V. (1996) in his research. In the research, Vuuren, S.V.
compared the discriminability and robustness against noise for both Perceptual Linear
Prediction (PLP) and Linear Predictive Coding (LPC). Particularly for PLP, the spectral
scale is the non-linear Bark scale and the spectral features are smoothed within the
frequency bands.
The purpose of this research is to convert the speech waveform into the form with
Coefficient (MFCC) technique to extract features from the Quranic verse recitation can be
explored and investigated. MFCC is perhaps the best popular features extraction method
26
used recently (Bateman, D. et al., 1992; Ehab, M. et al., 2007), and MFCC also one of the
most popular feature extraction techniques used in speech recognition, whereby it is based
on the frequency domain of Mel scale for human ear scale (Chetouani, M. et al., 2002).
MFCC’s are based on the known variation of the human ear’s critical bandwidths with
frequency. Speech signal had been expressed in the Mel frequency scale, in order to capture
the important characteristics of phonetic in speech. This scale has a linear frequency
spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. Normal speech
waveform may vary from time to time depending on the physical condition of speakers’
vocal cord. Rather than the speech waveforms themselves, MFFCs are less susceptible to
Based on the research conducted by Ahmad, A.M. et al. (2004), Mel scale has been
used to perform filterbank processing to the power spectrum. It has been performed after
the windowing and FFT process had been implemented. Similar approaches also had been
carried out by Tabbal, H. et al. (2006). The use of the MFCC has proven the remarkable
result in the field of speech recognition. It is because, the behavior of the auditory system
had been tried to emulate by transforming the frequency from a linear scale to a non-linear
one.
Cepstral Coefficients (MFCCs) is been coded for recorded speech data. Pitch marks were
produced using Wavelet transform approach by using the glottal closure signal. This signal
is obtained from the professional speaker during the recording process. Khalifa, O. et al.
27
(2004), had identified the main steps for MFCCs, that clearly shown in figure 2.3 below.
1. Preprocessing 5. Mel-Filterbank
2. Framing 6. Logarithm
4. DFT
Same as well as Hasan, M.R. et al. (2004), MFCCs has been used for feature extraction for
security system based on speaker identification. Here, the pitch tone of the speech signal is
measured on the ‘Mel’ scale. The Mel-frequency scale formula is based on mathematical
28
2.5.2.4 Spectrographic Analysis
There are a few Arabic speech recognition systems, which are normally speaker
dependent and use the different techniques such as formants values and their trends. Here,
compare than simple formants values. The objective research of Bashir, M.S. et al. (2003)
is to implement one of the feature extraction strategy for Arabic language phoneme
spectrograms are represented by particular distinct bands within the spectrogram, which
can be identified from each phoneme of Arabic language. Determination of the particular
phoneme depends on certain frequency band specified. Based on the result, speech
processing using spectrograms gives more accurate results compared to other conventional
techniques. However, spectrogram analysis is believed takes more times to execute and
pattern recognition, which is one of the most challenging problems for machines. The main
objective of pattern recognition is to classify the object of interest into one of a number of
categories or classes. The object of interest is known as patterns, but in this case the classes
are referring to the individual words. Since the classification procedure applied on this
research is implemented on extracted features, so it also can refer as feature matching. The
pattern matching for recognition purposes is divided into 3 types, which are:
29
1. Hidden Markov Model (HMM)
Nathan, K. et al. (1995) had implemented the HMM’s for recognizing handwritten
words captured from a tablet. It is because Hidden Markov Model (HMM’s) had been
successfully applied for speech recognition system. Moreover, the output of the front-end
was then used to feed the sphinx core recognizer that used the Hidden Markov Model
(HMM) as the recognition tool. This recognition method had been implemented by Tabbal,
H. et al. (2006) in their research. The results of the recognizer have been implemented in a
hash map to translate into the common Arabic words. HMM has generates a discrete time
random process consisting of two sequences of random variables, which are hidden and the
known observations. The underlying structure of HMM is set of states that associated with
the probabilities of transitions between states, known as Markov Chain (Hansen, J.C.,
2003).
In the other hand, the acoustic decisions trees used in synthesis are built from the
HMM alignment. The HMM alignment is done by Youssef, A. & Emam, O. (2004), where
acoustic, energy, pitch and duration trees have been developed and executed with the
efficient maximum-likelihood algorithms existed for HMM training and recognition (Lee,
30
(a) HMM Training
training HMMs. All the algorithms of HMM play a crucial role in ASR (Automated
Speech Recognizer). It involved with states, transitions and observations map into
the speech recognition tasks. The extensions to the Baum-Welch algorithms needed
to deal with spoken language. This method had been implemented by Jurafsky, D.
& Martin, J.H. (2007) in their research. Here, speech recognition systems trained
each phone of HMM that embedded in an entire sentence. Thus, the segmentation
and phone alignment are done automatically, as parts of the training procedure.
(Rabiner, L. & Juang, B.H., 1993). Those HMM model parameters (A, B, π) need to
HMM has introduced the Viterbi algorithm for decoding the HMMs. Any
sequence via feature analysis of the speech regardless to the word are made. The
word is selected using the Viterbi algorithm, whose model likelihood is maximum
31
2.5.3.2 Artificial Neural Network (ANN)
computational model or mathematical model based on biological neural networks. ANN are
made up from the artificial neurons interconnecting and it may either used to gain an
without necessarily creating a model of a real biological system. Artificial Neural Network
(ANN) belongs to the Artificial intelligence approaches, which attempt to mechanize the
recognition procedure. The procedure is depend to the way a person applies intelligence in
visualizing, analyzing and characterized the speech based on a set of measured acoustic
happen when 2 sequences are not synchronous, where the proper alignment, segmentation
and classification should be included. Thus, the basic neural networks are not well
Figure 2.4: Interconnected group of nodes in ANN (from Huang, X. et al., 2001)
32
2.5.3.3 Vector Quantization (VQ)
discrete symbols. It can be quantizes on a single signal value or parameter known as scalar
quantization, vector quantization or others. Related to this topic, Huang, X. et al. (2001)
had described the vector quantizer as the codebook, which is a set of fixed prototype
vectors or reproduction vectors. Each prototype vectors also known as a codeword. In order
to perform the quantization process, each of input vector need to be matched against each
codeword in the codebook, using distortion measured. Thus, the VQ process includes the
distortion measure and the generation of each codeword for particular codebook involved.
The goal of VQ is definitely on how to minimize the distortion (Vuuren, S.V., 1996).
Features training are mainly concerned with randomly selecting features vector and
perform training for the codebook using Vector Quantization (VQ) algorithm. Besides, the
features training are also involved with Vector Quantization (VQ). The training process of
the VQ codebook applies an important algorithm known as the LBG VQ algorithm, which
is used for clustering a set of L training vectors into a set of M codebook vector. This
algorithm is formally implemented by the recursive procedure (Linde, Y. et al., 1980). The
following steps are required for the training of the VQ codebook using the LBG algorithm
33
Figure 2.5 shows a block diagram of a vector quantizer, which consist of two main
parts known as encoder and decoder. The task of encoder is to identify in which of N
geometrically specified regions of the input vector lays. Then, the decoder refers to the
table lookup and it is fully determined by specifying the codebook (Wai, C.C., 2003).
2.5.4 Recognition/Identification
There are many methods used for recognition as well as identification. Under the
same techniques of speech recognition, the normally methods used nowadays listed as
below:
As described at section 2.5.3 under the feature classification, HMM method had
been fully implemented for both recognition and training purposes (Lee, K.F. & Hon,
H.W., 1989). Under the same research handled by Jurafsky, D. & Martin, J.H. (2007), digit
34
recognition task for HMM recognition have been used. A lexicon specifies the phone
sequence and each phone of HMM are composed from three sub-phones with a Gaussian
model. By combining all these elements with adding an optional silence at the end of each
word will results into a single HMM for the whole task. Note that, the transition from the
‘End state’ to the ‘Start state’ is to allow digit sequences of arbitrary length.
In the order hand, recognition also has been carried out by Viterbi, A.J. (1967),
through the research in a large HMM. For context-independent phone recognition, an initial
and a final state are created. The initial state is connected with null arcs to the initial state
of each phonetic HMM, and null arcs connect the final state of each phonetic HMM to the
final state. The final state is also connected to the initial state. Hidden Markov Model
frames for certain utterance. The process can be assumed as a random process and the
specific codebook is generated by clustering the training features vector of each speaker,
which described at part 2.5.3.3. In the recognition stage, an input utterance is vector-
quantized using the codebook of each reference speaker and the VQ distortion has
35
accumulated over the entire input utterance, which is used to make the recognition
decision. It is believed that VQ-based method is more robust than a continuous HMM
method, which had been stated by Matsui, T. & Furui, S. (1993) in their research.
Artificial Neural Network (ANN), also known as Neural Network (NN). ANN also
mainly used as feature matching or recognition for speech processing. It normally used to
classify a set of features, which represent the spectral-domain content of the speech
(regions of strong energy at particular frequencies). The features then will be converted into
phonetic-based categories at each frame. Then, Viterbi search is used to match the neural-
network output scores to the target words (the words that are assumed to be in the input
speech), in order to determine the word that most likely uttered (Hosom, J.P. et al., 1999).
systems using the techniques discussed in this literature. The main criterion for the
comparison of the approaches used on the Quranic Arabic based on its performances is
36
Table 2.4: Approaches used by Quranic Arabic recitation using speech recognition
techniques
Pre- Feature Classification/
References processing Extraction Recognition Performance
Method Techniques
Pre-
[Tabbal, H. emphasis MFCC Hidden Markov 85% - 92%
et al. ‘06] filter Model (HMM)
[Youssef, A.
& Emam, - MFCC Hidden Markov 90.2%
O.’04] Model (HMM)
[Ahmad, Recurrent
A.M. et Digital MFCC Neural Network MFCC 95.9%-98.6%
al.’04] Filtering LPCC (RNN) LPCC 94.5%-99.3%
Spectrographic
[Bashir, Preemphasis Analysis based
M.S. et al. filtering Spectrographic on different 93.33%
‘03] (Bandpass Analysis frequency band
filter) of intensity
[Kirchoff, Kneser-Ney Hidden Markov
K. et al.’04] smoothing Not stated Model (HMM) Not stated
[Hasan, Vector
M.R. et - MFCC Quantization 57%-100%
al.’04] (VQ)
[Podder, -
S.K.’97] LPC VQ and HMM 62%-96%
[Bhotto,
M.Z.A & Vector
Amin, - MFCC Quantization 70%-85%
M.R.’04] (VQ)
37
2.7 Summary
In this study, all different methods or approaches have been discussed, in order to
find the most suitable method to be used in this project. After a while, method or
approaches which logically can be used in this project had been decided. MFCCs method
was decided to be used in feature extraction, because it implements the DFT and FFT
algorithms. Moreover, majority of researches had used MFCCs, as their main features for
extraction purposes.
In order hand, the training as well as recognition part will be conducted either using
the HMM, ANN or VQ. Those 3 methods normally used in speech recognition purposes
recently and the most dominant pattern recognition techniques used in the field of speech
recognition. Moreover, these methods have shown great performance equally, through the
different ways and expectations. Both methods have their own benefits and weaknesses.
From the point of view, HMM is the most suitable methods used and this methods have
with low percentage of accuracy. It totally different with VQ algorithm that mostly used by
the researchers through their research project on Speech Recognition related with English
language. In addition, the combinations for both MFCC and HMM techniques were mostly
implemented for speech recognition application, especially for Arabic language, as shown
in table 2.4. Besides, those techniques also have been successfully proven to be applied in
this research, since the percentages of the performance were above 90%.
38
CHAPTER 3
RESEARCH METHODOLOGY
3.1 Introduction
specific task for achieving the main goal of the system. Therefore, a combination of related
algorithms is essential, due to improve the accuracy and recognition rate of such
applications. However, in this chapter it will highlight on research methodology for the
development of Automated Tajweed checking rules engine for Quranic verse recitation,
which mainly stress on the techniques and algorithms used for the development and
Here, the main algorithm used for feature extraction technique of Mel-Frequency
phonemes of the Quranic recitation. Besides, this engine also implemented the Hidden
recognition purposes.
39
3.2 Tajweed checking rules engine techniques and algorithms
recognition of Hidden Markov Model (HMM) has been highlighted. In this technique, the
features vector of speech is been extracted and the recognition results were depends on its
log likelihood for each of word in vocabulary. The largest value of log likelihood is decided
as recognition result. Since different people can give different pronunciation even for the
same sentence, so the HMM classification is used to improve the accuracy of recognition.
In this chapter, the techniques and algorithms involves in this research will be
discussed in details. First, the input instruction is filtered to get rid of the noise and the
Coefficients (MFCC), due to extract the important characteristics of speech signal, which
represent a set of features vector as the output result. Then, the whole sentence can be
estimated and classification can be made. Here, pattern classification method used is known
as Hidden Markov Model (HMM). The entire process in this research is shown as below:
Stage 1: Training
Begin
40
Step 2 : Preemphasis is executed – Finite Impulse Response (FIR) filter
Step 7 : HMM model is developed, i.e: λ (A, pi0, mu, sigma) is evaluated and
End
Stage 2: Testing/Recognition
Begin
Step 7 : HMM model is developed, i.e: λ (A, pi0, mu, sigma) is evaluated
Step 8 : The observation sequence and HMM values, obtained from the test input
are compared with all models present in the database, through the Viterbi
algorithm
Step 9 : The recognition results of the recognized word is decided based on the
maximum value of log likelihood of the test data match with trained data
End
41
3.2.1 Speech Samples Collection (Speech Recording)
In this part, recording process will be executed, due to collect the Quranic recitation
of speech samples from different speakers. According to Rabiner and Juang (1993), there
are 4 main factors need to be considered while collecting the speech sample, such as:
These 4 factors need to be identified first, before any process of recording executed.
It is because; these factors will affect the performance and the output result, especially the
training set vectors that will be used in training and testing process. In this project, this
automated Tajweed checking rules engine has used a simple MATLAB function for
recording the speech samples. In figure 3.1 below, shows the MATLAB code used in
42
However, this function requires the user to define the certain parameters before the
recording process were carried out. Those parameters include the sampling rate (Hz), as
well as time length in seconds. Here, the MATLAB command "wavrecord" is used to read
the audio signals from the microphone directly. The command format is:
Where "n" is the number of samples to be recorded and "fs" is the sampling rate. In
this recording part, duration of time length for the recording process took place is 4
seconds, recorded using the normal microphone. "Duration*fs" is the number of sample
points to be recorded. The recorded sample points are stored in the variable "y" with vector
size of 64000x1.
In the late 1970s, coefficients derived from the cepstrum began to replace the Linear
Prediction Coefficients (LPC) as the basic algorithm and parameter set for speech
used nowadays for feature extraction technique in speech processing. In this technique, the
used of Mel scale in the derivation of cepstrum coefficients was introduced. The Mel scale
is a mapping of the linear frequency scale based on human auditory perception (Levent,
M.A., 1996).
43
As mentioned earlier, the main objectives of feature extraction is to extract the
important characteristics from the speech signal, that are unique for each word, due to
differentiate between a wide set of distinct words. According to Ursin, M. (2002), MFCC is
considered as the standard method for feature extraction in speech recognition and perhaps,
the most popular feature extraction technique used nowadays. MFCC able to obtain a better
compared to other feature extraction techniques (Davis, S.B. & Mermelstein, P., 1980).
DCT
The proposed method for feature extraction is given in figure 3.2 shown above. At
this stage, it will emphasize on MFCC computational process, as the main algorithm for
feature extraction analysis. Here, the feature extraction algorithm of MFCC has been used
and applied to all collected of speech samples to obtain the targeted output of features
vector. There are certain parameters need to define first before the MFCC algorithm and
coefficients value were estimated. In table 3.1, shows the parameter values as well as
MFCC filter equations, which is used in this entire MFCC MATLAB code.
44
Table 3.1: MFCC Parameter Definition
Parameter Value
Number of filters 40
Parameter Value
The voice input is recorded using the normal microphone and sound recorder utility
engine, the speech sample is 16 000 Hz for 4 seconds of time length, with a sampling
precision of 16 bits. In the preprocessing stages, array of speech signal were obtained from
the microphone after the recording process. The time graph and spectrum of speech signal
were calculated and displayed in both time graph, as well as spectrum graph in specific
figure of plot format. Figure 3.3 shown below is the result of the time graph and spectrum
45
Figure 3.3: Time and Spectrum graph for the recitation “Bismillahi Al-Rahmani Al-
Rahim”
3.2.2.1 Preemphasis
Preemphasis is considered as the first step of MFCC under the preprocessing stage
in speech processing, which involved the signal conversion from analog to digital signal.
The sequence of samples x[n] is obtained from the continuous time signal x(t), which stated
46
Where, T is the sampling period and 1/T = fs is the sampling frequency, in
samples/sec. ‘n’ is represented as the number of samples. The above equation is mainly
used to obtain a discrete time representation of a continuous time signal through periodic
sampling. The size of the sample for digital signal is determined by the sampling frequency
and the length of the speech signal in seconds. At the first stage in MFCC feature
extraction, the amount of energy is used to boost into the high frequencies. It can be seen
through the spectrum graph for speech segments like vowels, where there is more energy at
the lower frequencies compared to the higher frequencies. This drop of energy across
frequencies is caused by the nature of the glottal pulse (Jurafsky, D. & Martin, J.H., 2007).
When the frequency increases, preemphasis also increased the energy of the signal. This
Preemphasis can be executed after the digitization of a speech signal through the 1st order
H z 1 z 1 (3)
47
Where, α is the preemphasis parameter set to a value close to 1, in this case 0.97. By
applying the FIR filter to speech signal, the preemphasis signal will be related with the
3.2.2.2 Framing
After preemphasis filtering process executed, the filtered of input speech will be
framed. Here, the columns of data from the particular speech input will be determined. The
Fourier Transform used here, only reliable when the signal is in a stationary position. In
this case, speech or voice implementation holds within a short time interval only, less than
100 milliseconds of frame rate. Thus, the speech signal will be decomposed into a series of
short segments and each of the frames will be analyzed, then any useful features will be
extracted from it. A 256 of window size or point frame is chosen in this research, which
48
FRAMING SIGNAL
1
Amplitude
-1
-2
-3
-4
3.2.2.3 Windowing
Windowing is one of the important parts in MFCC feature extraction process. Here,
each individual frame of speech signal is windowed, due to minimize the signal
discontinuities at the beginning and at the end of each frame. The purpose of this action is
to minimize the spectral distortion and to taper the signal to zero at the beginning and at the
N is the number of samples in each frame. The result of windowing signal of (y (n)) is
defined as:
49
The Hamming window, w(n) used in this work is given by equation (6) below:
2n
0.54 0.46 cos ,0 n N 1
w(n) N 1 (6)
0, otherwise
Figure 3.7 shows the MATLAB code for performing the windowing of the segmented of
speech samples, whereas figure 3.8 shows the hamming window graph developed.
The effect of windowing of speech sample can be visualized clearly in figure 3.9. The
transition of speech sample seems to be smooth towards the edge of the frame.
50
HAMMING WINDOW
1
0.9
0.8
0.7
0.6
Amplitude
0.5
0.4
0.3
0.2
0.1
50 100 150 200 250
Samples
1
Amplitude
-1
-2
-3
-4
51
Once the data of speech sample is framed and windowed, the data at the end of the
frame is going to be likely reduced to zero and resulted with the loss of information. Thus,
the overlapping approach is allowed to be executed between frames. It allows the adjacent
frame to include the portion of data into the current frame. Meaning that, the edges of the
current frame will be included at the center of adjacent frames. Normally, around 60% of
overlapping signal is sufficient to cover the lost information and also attempt to smoothen
the varying parameters. Fast Fourier Transforms (FFT) is applied to windowed of speech
sample, which converts each frame of N samples from the time domain into the frequency
domain.
According to Owen, F.J. (1993), the Discrete Fourier Transform (DFT) normally
computed via Fast Fourier Transform (FFT) algorithm. This algorithm is widely used for
evaluating the frequency spectrum of the speech and converts each frame of N samples
from the time domain into the frequency domain. The FFT is defined on the set of N
samples X n as:
N 1
X n x k e 2kn / N , Where n 0,1,2...N 1 (7)
k 0
In this research, the windowed of speech segment is transformed into the frequency domain
by using the Fourier Transform through the MATLAB command shown in figure 3.10. It
52
3.2.2.5 Mel Filterbank
Mel scale is applied due to place more emphasize on the low frequency
speech signal is more important than the high frequency components. Mel scale is a unit of
special measure or scale of perceived pitch of tone. Mel Filterbank also known as Mel
Frequency Warping, where it does not correspond linearly to the normal frequency, but
behaves linearly below 1000Hz and a logarithmic spacing above 1000Hz. The following
equation shown below is the approximate empirical relationship to compute the Mel
transform of speech segment is binned by correlating them with triangular filter in the
53
In this part, the cepstral coefficients of Mel-Frequency Cepstral Coefficients
(MFCC), which corresponding to the input were obtained. Those output results can be seen
-5
-10
Amplitude
-15
-20
-25
-30
-35
2 4 6 8 10 12
Samples
(DFT), but using the real numbers only. DCT used to extract the Mel Frequency Cepstral
Coefficients (MFCC) results, and it is often used to calculate the cepstrum instead of
inverse FFT.
54
In this research, this part was the final step of computing the MFCCs. It required
computing the logarithm of the magnitude spectrum, in order to obtain the Mel-Frequency
Cepstral Coefficients. The MFCCs at this stage are ready to be form in a vector format
known as features vector. This features vector is then considered as an input for the next
process, which is concerned with training the features vector for recognition purposes. The
Figure 3.13: The MFCC cepstral coefficients for ayates ‘Maaliki yawmid diini’
55
3.2.3 Hidden Markov Model Classification
characterizing the spectral properties of the frames for a certain pattern. Using the HMM,
the input of speech signal is well characterized as a parametric random process and the
The parameter of HMM model need to update regularly, due to make the system able to fit
a sequence for particular application. Thus, the training of the HMM model is so important,
due to represent the utterances of words. This model is used later on in the testing of
utterances and calculating the probability of HMM model, in order to create the sequence
of vectors.
because, it consist of doubly embedded stochastic process with underlying, that is not
directly observable (hidden), but can be observed through another set of stochastic process
only, that may produce the sequence of observations (Rabiner, L.R. & Juang, B.H., 2003).
In this research, HMM with Multivariate Gaussian state conditional distribution has
been used in Hidden Markov Model (HMM). The HMM for discrete symbols observations
N : Number of states.
56
pi0 (π0) : Row vector that containing the probability distribution for the first
sigma (Σ): Covariance matrices. These values will be stored in 2 different ways
Figure 3.14 shown below depict an automated Tajweed Checking rules system
structure, which illustrated a speaker recognition system for Quranic verse recitation. There
are 2 main stages in a speech recognition system, which are training and recognition
stages. Under the training stage, models (patterns) are generated from the input of speech
samples, after the feature extraction process and modeling techniques. Meanwhile, in the
recognition stage, features vector will be generated from the input speech samples with the
same extraction procedures in the training stage, mentioned earlier. After that classification
process, as well as the decision process was made and executed with some matching
techniques. Under the classification type, the recognition task can be divided either
57
Figure 3.14: Automated Tajweed Checking Rules system structure (λ = model parameter)
Moreover, the distinct HMM is used to model the vocabulary of words. Each word
in the vocabulary has a training set of k utterances by different speakers (Rabiner, L. &
Juang, B.H., 1993). Those utterances constitutes with an observation sequence of MFCC.
The isolated word of speech recognition mainly for Automated Tajweed checking rules
58
(1) Training/Modeling: Each word in the vocabulary build an HMM model and
to the word were made. Lastly, the word is selected using the Viterbi algorithm,
(3) Verification: The input features were compared with the registered pattern, and
any features that giving the highest score is identified as the selected/target
speaker (recitor) and recitation results. Then, these input features are compared
with the claimed speaker (recitor) and decision is made either to accept or reject
the claimed/results.
According to these 3 major steps listed above, the training/modeling step was executed
during HMM training, while the identification and verification steps were carried out
The training of Hidden Markov Model is used to model and represent the
particular utterances of word or phoneme from the Quranic recitation. Thus, a complete
specification of HMM from 2 items for observation symbols of HMM model parameters, N
and p, as well as 3 sets of probabilities measures A, mu, sigma and initial state distribution,
59
pi0 are required. According to Hemantha, G.K. et al. (2006), the complete parameter of
HMM model is denoted by λ = (A, B, pi0), but in this research B represent by 2 sets of
( A, pi 0, mu, sigma ) , that maximizes P(O/ ). The values obtained from the λ model
will be stored in the database, for further processing in testing/recognition part in stage 2.
(a) Initialization
A: The state transition probability matrix, using the Left-to-Right Model. The state
transition probability matrix, A is initialized with the equal probability for each
state and it can be made in sparse, due to save memory space (A should be upper
60
The values of A were obtained after the MATLAB simulations were successfully executed.
Those A values is initialized with equal probability for each state, denoted as below in
Figure 3.16: The state transition probability matrix (A) for ayates ‘Maaliki yawmid diini’
pi0: Initialize the initial state of probability distribution, using the left-to-right
deterministic and in state 1 at the beginning (ie. pi0 = [1 0 … 0]). This description
pi0i = 1 0 0 0 0 0 0 0 0 0 0 0 0 0
61
Where, 1 ≤ i ≤ number of states. In this case, for the ayates ‘Maaliki yawmid diini’,
Initialize the Mean vectors (mu (μ)) and Covariance matrices (sigma (Σ)), for
Model (HMM). These values are able to determine the dimensions of the model
(size of observation vector and number of states) and the type of covariance
Figure 3.17 shows the MATLAB code for initializing the model parameter of mu (µ) and
sigma (Σ), using multiple observations for Left-to-Right Hidden Markov Model (HMM).
Here, each parameter sequence of speech is chopped into N segment of equal length, where
Figure 3.17: MATLAB code for initialize the model (mu, sigma)
It is believed that most functions (with mu and sigma as their input arguments) are able to
determine the dimensions of HMM model (size of observations sequence and number of
states (N)) and type of covariance matrices (either full or diagonal) from their input
62
arguments. It can be calculated through the functions hmm_chk. Below are the model
parameter values of mu (µ) and sigma (Σ) for the ayates ‘Maaliki yawmid diini’.
Figure 3.19: The mean vectors mu (µ) for ayates ‘Maaliki yawmid diini’
63
Figure 3.20: The covariance matrices sigma (Σ) for ayates ‘Maaliki yawmid diini’
a Left-to-Right Model are performed, with multiple training sequences. This process was
just a call to the lower-level functions, where the supplied values from the previous part in
3.2.3.1 (a) of A, mu and sigma were also used as initialization (A_, mu_, sigma_) values.
64
These values will be used and implemented in the next process, where the Forward-
Backward Recursions (with scaling) process will be executed. Figure 3.21 shows the
Based on MATLAB code shown above, alpha is the forward variable, meanwhile beta is
the backward variable with log1 variable as the log likelihood values. Notice that, at each
step the log-likelihood is computed from the forward variables using log1 term, returned by
hmm_fb (forward-backward), which is sum of logarithmic scaling factors used during the
computation of alpha and beta. Another variables of dens, contains the values of Gaussian
densities for each time index (useful variables for the transition probabilities estimation).
Below are the brief descriptions of those variables involved in this part:
model λ = (A, pi0, mu, sigma) can be carried out, by finding for which of
the model that most likely has produced the observation sequence. Thus,
every possible sequence of states for length T can be evaluated, through the
T
P(O | ) q aq q bq (ot )
q1 ,q2 ,...,qT
1
t 2
t 1 t t
(14)
65
Based on equation (14), initially at time (t = 1) P is in state q1 with
bq2 (o2 ) . The process was continued in this manner until the last transition is
made (at time T). i.e., A transition from qT 1 to qT will occur with
probability aqT 1qT , and the symbol oT will be generated with probability
defined by:
the model, in which the partial observation sequence from the first
1. Initialization
Set t = 1;
1 (i ) ibi (o1 ), 1 i N
66
the symbol o1 . Only 1 (1) will have a nonzero
2. Induction
N
t 1 ( j ) b j (ot 1 ) t (i )aij , 1 j N
i 1
3. Update time
Set t = t + 1;
Return to step 2 if t ≤ T;
4. Termination
N
P(O | ) T (i )
i 1
row of the alpha matrix sums to 1, except the first one as shown below for
Alpha scale = 11
02
03
.
.
.
0T
67
Where, 1 ≤ T ≤ number of input. In this case, for the ayates ‘Maaliki yawmid
reverse way, then the t (i ) will be the backward variable. This variable is
From the equation (16), t (i ) is the probability at time t and state i given by
until observation number T, ot 1ot 2 ...oT having generated. The variable can
1. Initialization
Set t = T – 1;
T (i ) 1, 1 i N
2. Induction
N
t (i ) t 1 (i )aij b j (ot 1 ), 1 i N
j 1
3. Update time
Set t = t - 1;
Return to step 2 if t ≥ 0;
68
Otherwise, terminate the algorithm.
The backward variables are scaled using the same normalization factors as
matrix to see the adjustment of the re-estimation sequence. In this case, Log
the sum (log (scale)) of total probability has been used for every iteration.
The current value of log1 is compared with the previous log1 in previous
iteration, where if the different value (measure value) is less than threshold
(c) Re-Estimation
The recommended algorithm used for re-estimation parameters for the model, λ =
(A, pi0, mu, sigma), is by using the Iterative Baum-Welch Algorithm. This algorithm
responsible to maximize the likelihood function for the model λ = (A, pi0, mu, sigma).
Here, for every iteration the Baum-Welch algorithm will re-estimate the HMM parameters
forward algorithm and the backward algorithm, which have been implemented before.
69
As mentioned earlier in part 3.3.3.1(b), the values of A, mu and sigma were also
been used as initialization (A_, mu_, sigma_) values. Those values will be used to re-
estimate the transition parameters for the multiple observation sequence left to right HMM.
However, before the process was carried out, the dimensions of the HMM model need to be
checked and determined first, through hmm_chk function as discussed before. Then, the
In this case, the matrix X contains all the observation sequences, while the vector st
yields the index, which corresponding to the beginning of each sequence. Thus, X (1: st
(2)-1, :) contains the vectors that relate to the first observation sequence, until X (st (length
(st)), length (X (1, :),:), which corresponds to the last observation sequence. Meanwhile,
distributions of states are returned in gamma (γ). In other hand, note that mix_par also has
been used for re-estimating of HMM parameters (mu_ and sigma_) from posterior state
below:
70
* arg max[ P(O | )] (17)
Here, the re-estimation process of matrix A is quite extensive, due to the use
T 1
A new mean value, x_mu (m, n) is used for the next iteration of the process,
t ( j , k )o t
jk t 1
T
(18)
t 1
t ( j, k )
A new covariance value, x_sigma (m, n) is calculated and used for the next
t 1
t ( j , k )(ot j )(o t j ) '
jk T
(19)
t 1
t ( j, k )
71
(d) Result – Model of Hidden Markov Model (HMM)
Lastly, after the re-estimation process successfully executed, the HMM model for
the specific utterance need to be save. The model developed represent the specific
observation sequences, i.e: Isolated word, in which it used for recognition purposes later
on. The HMM model obtained, will be discussed in details at chapter 5. The model is
presented with the specific denotation λ = (A_, mu_, sigma_) of MATLAB MAT-file
(matrices 7x14), but here only half of it was shown in figure 3.23 (‘Maaliki yawmiddiini’):
Figure 3.23(b): MAT-file trained model of mu_ (μ) values (State, i=1-13)
Figure 3.23(c): MAT-file trained model of sigma_ (Σ) values (State, i=1-13)
Decoding or aligning the acoustic feature sequence requires the prior specification
of parameter from the particular HMM. As mentioned earlier, the HMM model has a role
of stochastic templates, for comparing the observations. Those templates consist of several
72
sentences, which represent different phonemes of Quranic recitations. Each of templates
can be determined and identified through the estimation of HMMs parameter, specified by
Based on HMM basic concepts, the parameter defines the probability measure
observation sequence and model. Due to maximize P (q|O,), the suitable algorithm to be
used must be Viterbi algorithm (Rabiner, L.R., 1989). The Viterbi algorithm is used to find
the best single state sequence for the given observation sequence (Rabiner, L.R. & Juang,
B.H., 1993). The testing process was carried out, where the tested utterances are compared
with each model and then, the score value is obtained after each comparison executed.
In this case, the observation sequences of O do not involved in calculation, but the
feature analysis of MFCC of speech samples are corresponding with the word. For
example, a reasonable measure of the similarity for two HMMs model of 1 and 2 , using
the concept of logarithmic distance (defining the distance measure D(1 , 2 ) ) between two
73
Where O2 (O1 , O2 ...Ot ) is a sequence of observations generated by model 2 . Basically,
the expression of O2 shown above is the measure of how well model 1 matches the
Under the same concepts mentioned above, the above equation (20) has been
implemented with the current research application, mainly in recognizing the Tajweed rules
based on certain ayates of Quranic recitation. Here, the log likelihood of the
word/phoneme itself acts as measurements. The standard Log Likelihood Ratio (LLR) is
calculated as follows:
Here, N is the length of input utterance, log .P(best O) is the largest log likelihood
and log .P(2nd best O) is the second largest log likelihood. The HMM testing is done in
such matter that the particular utterance to be tested is compared with each model, and
output score result is defined for each comparison. The sequence for the test of the Quranic
74
(a) Initialization
(i) Log (A): State transition probability matrix of the model (Refer to
HMM training)
Load the A_ values (MAT-file) from the trained model λ and calculate the
the zero components will turn into minus infinity. To avoid this problem,
Matlab ‘realmin’ (smallest value) value can be used. It can be shown, based
(ii) mu (μ): Mean matrix from the model (Refer to HMM training)
Load the mu_ (μ) values (MAT-file) from the trained model λ.
(iii) Sigma (Σ): Variance matrix from the model (Refer to HMM training)
Load the sigma_ (Σ) values (MAT-file) from the trained model λ.
(iv) Log (pi0): Initial state probability vector (Refer to HMM training)
The problem was similar likely with Log (A). Thus, a small number is added
75
details at part 3.2.3.2 (a) of (i). Note that, the value of π is same for each
model.
(i) Log (P*): The probability calculation of the most likely state sequence. The
max argument is at the last state. Here, log1 has been used to represent Log
P.
(ii) plog1: The state that give the largest Log (P*) at time T is calculated. Later,
the backtracking is used.
(iii) Path: Backtracking state sequence. The optimal state sequence is calculated.
(iv) Log (B): Compute the probability of density values as i (state) from the
previous chapter (HMM training). Here, dens has been used to represent it.
(v) Delta (δ): Maximization of a single path needs for the quantity of δ t (i).
The quantity of δ t (i) is probability that observed o1 o2 o3… ot, for the best
path, which ends in state i at time t, for a given model.
(vi) Psi (ψ): The optimal state sequence is retrieved and saved in a vector ψ t (j),
76
The ayates and phonemes of the Quranic recitation is recognized after comparing
the testing model with the help of Viterbi algorithm. This algorithm is used to find the
single best state sequence for the given observation sequence (Rabiner, L.R. & Juang, B.H.,
1993). The following steps for finding the best state sequence are included in the
1. Preprocessing
~i log( i ), 1 i N
a~ij log(aij ), 1 i, j N
2. Initialization
Set t = 2;
~
bi (o1 ) log(bi (o1 )), 1 i N
~ ~
1 (i) ~i bi (o1 ), 1 i N
3. Induction
~
b j (ot ) log(b j (ot )), 1 j N
~ ~ ~
t ( j ) bt (ot ) max[ t 1 (i ) a~ij ], 1 j N
1 i N
4. Update time
Set t = t + 1;
Return to step 3 if t ≤ T;
5. Termination
~ ~
P * max[ T (i )]
1 i N
77
~
qT arg max[ T (i )]
1 i N
a. Initialization
Set t = T - 1;
b. Backtracking
qt t 1 (qt*1 )
*
c. Update time
Set t = t – 1;
Return to step b if t ≥ 1;
(i) Score
The result of the score was obtained from the Viterbi algorithm. From the
result (output score) for each comparison. Below is the result of the output
Figure 3.26 (a): Output score for the ayates ‘Maaliki yawmiddiini’
78
(ii) Log-Likelihood Ratio (LLR)
From the output score obtained above, the maximization of these probability
values need to determine using Log Likelihood Ratio (LLR). The highest
output score gained is the highest probability that the HMM model (compare
model) has produced for the particular test utterance given, based on the
rank of the threshold value set. In this case, the result of confidence score of
LLR is 0.7253 x 103, which is above the threshold value set > 0.2. The
Figure 3.26 (b): Log-Likelihood Ratio (LLR) for the ayates ‘Maaliki
yawmiddiini’
3.3 Summary
This chapter has presented a brief description of technical overview of MFCC and
HMM, and how both algorithms relate each other. It was clearly stated that MFCC handles
the feature extraction process, which then produces features vectors outputs of the Quranic
recitation. These output values are considered as the training set used in the HMM
pattern recognition technique that classifies different signals of the Quranic recitation,
79
The combination of MFCC and HMM have been widely used in speaker
recitation with both algorithms (MFCC & HMM) are still considered as a new approach.
Thus, this research studies the possibility of using this combination in Automated Tajweed
Checking Rules Engine for Quranic Verse Recitation. Besides, this chapter also has
80
CHAPTER 4
4.1 Introduction
This chapter emphasizes on the design and implementation for the development of
Automated Tajweed checking rules engine for Quranic verse recitation. In this system, it
will cover all aspects from various diagrams and parts, which exhibit the logical and
There are a few diagrams shown, which probably include the most relevant diagrams of
Quranic verse recitation recognition based on speech recognition system, such as context
diagram, data flow diagram, flow chart and other diagrams. Finally, this chapter also
provides some snapshots of the Quranic verse recitation recognition graphical user
interface (GUI).
Figure 4.1: Automated Tajweed Checking Rules for Quranic verse recitation context
diagram
81
4.2 Overview of Automated Tajweed Checking Rules Engine
According to the research, the project is mainly focus on the basic of speech
recognition technology, but it will implemented into the different type of application or
languages such as Quranic Arabic. Those different of input content, which had been
implemented in this engine, would probably affect the percentage of accuracy during
recognition process. So, the reliability and effectiveness of the system also depend on
language and system design created. The system is implemented using the Programming
82
4.2.1 Engine Development Part
extract, store and analyze the parameters of Al-Quran recitation. The Mel-Frequency
Cepstral Coefficient (MFCC) and Hidden Markov Model (HMM) based algorithm is
currently selected for feature extraction and classification (comparison). Here, the process
of speech recording (speech samples collection), features extraction, features training and
pattern recognition were formulates the Quranic verse recitation recognition methodology,
which enhances the design of tajweed checking rules guidelines shown below. The
architecture/block diagram of this part will be shown clearly in this chapter, meanwhile the
For Content Development part, the sample of Quranic recitation is recited by the
certified teacher (Mudarris) and those samples will be stored in PC for analysis purposes.
Relevant GUI interface also developed, in order to provide the user-friendly system of
Automated Tajweed Checking Rules system. The Content Development part will
responsible for all the contents part. The task including, the preparation process of Al-
Quran contents namely Al-Quran transcript and Al-Quran recitation. For Al-Quran
part. Meanwhile, for Al-Quran recitation, currently each word of the first Chapter of Al-
Quran (Al-Fatihah) has been carefully recited by a certified teacher (Mudarris) and has
been stored in Personal Computer (PC). All the stored files (.wav) will be sent to Engine
83
Development part for integration process with speech processing technology. The Engine
and Content Development part will eventually work together to apply the speech
recognition technology, in order to analyze both recitations (teacher and student) based on
the Rules of Tajweed. If a student recites Al-Quran incorrectly, the system will show errors
on the Graphical User Interface (GUI) and show the playback for correct recitation.
According to the research, the project is mainly focus on the basic of speech
recognition technology, but it will implemented into the different type of application or
languages such as Arabic Quranic. Those different of input content, which had been
implemented in this engine, would probably affect the percentage of accuracy during
recognition process. So, the reliability and effectiveness of the system also depend on
The Quranic Arabic recitation is best described as long, slow pace rhythmic,
monotone utterance (Essa,O., 1998 ; Nelson & Kristina, 1985). The sound of the Quranic
tajweed, which designed for clear and accurate presentation of the text. The input of the
system is the speech signal and phonetic transcription of the speech utterance. Thus, this
project need to have speaker (input speech sample), features extraction, features training
and pattern classification/matching, which are components that are important for the
architecture of Automated Tajweed checking rules for Quranic verse recitation is adhere
84
with the Engine Development part, which has been mentioned earlier in previous part of
4.2.1. This part is divided into 3 main architectures, which are features extraction,
training/testing architecture and lastly the recognition architecture. Figure 4.1 shows the
Automated Tajweed Checking Rules for Quranic verse recitation context diagram that
represent the external look of the system, where the speaker perform their Quranic
recitation via Tajweed checking rules engine and receives the respond from the system
after processing the speech input samples, whereas the training/testing and recognition will
be respond respectively after that. The schematic Tajweed checking rules engine block
diagram is shown in figure 4.3, whereas both training/testing and recognition architecture
Figure 4.3: Block diagram schematic illustrating Tajweed checking rules engine
85
Figure 4.4: Tajweed checking rules engine architecture
Refer to figure 4.3 of Automated Tajweed Checking Rules engine block diagram,
as well as in figure 4.4 of system architecture, show us the process flow that taken part in
this research. The important parts involved in this research can be described as above.
Figure 4.3 show us the overall process in Quranic verse recitation recognition, which
represent in term of block diagram. In this block diagram, 2 distinguished phases have been
represented, which are enrolment or training phase and matching/testing phase, as shown
in figure 4.4. Training and matching/testing phase is totally different process, which had
been executed. In the training phase, each recitor needs to provide the samples of Quranic
recitation, so that the invented engine can build or train a reference model, specifically for
the particular recitor. Meaning that, in this part the researcher only need to train and stored
correct data of certain sourate or Quranic recitation into the database. In the case of speaker
verification process, a specific threshold value also can be computed from the training
samples by researcher. The aims of this action are to provide the correct data due to make it
86
In other hand, the input speech executed in matching/testing phase is matched with
stored reference model, and thus a decision can be made (recognition). The output data
from Hidden Markov Model (HMM) need to be responded and compared their output data
by referring to the database created during the training process. Meanwhile, at the same
time the system need to act upon the feedback result and then give the answer, either the
output data produces can match the stored data in database or not. If the output data is
slightly different from the stored data in database, the system will assume those output data
In this part, the data flow diagram shows the main process performed by the
Tajweed checking rules engine. There are four mains processes that will be performed for
different tasks as shown in figure 4.5. Those processes include receiving the Quranic
recitation (speech samples), analyzing speech, searching and matching of speech and
Recitor represent as a receiver of speech inputs and receiving the speech process
works from the Tajweed checking rules engine. The next process includes analyzing those
speech inputs followed by searching and matching the analyzed of speech samples. Lastly,
this Tajweed checking rules engine will produce and returns the matching results to the
recitor. This system will help and assist the recitors until the process successfully executed
87
Figure 4.5: Tajweed Checking Rules Engine Data Flow Diagram (DFD)
Tajweed Checking Rules Engine flow chart emphasizes on the system’s flow of
events. This engine has 5 main stages which include the sampling, segmentation, features
88
Figure 4.6: Automated Tajweed checking rules engine for Quranic flow chart
89
Stage 1:
Refer to figure 4.6, sample inputs speech were recorded within the particular time
frame. The speech input for the above utterance, will be segmented due to differentiate the
speech region and non-speech region. Non-speech region were immediately detected and
Stage 2:
It will become as input to the phoneme segmentation module, where the basic level
module will extract those speech signals, which is extensively used as a feature vector for
Stage 3:
HMM classification (recognition) represent for training and testing as well, which mainly
used for tajweed rules checking process. In training part, a set of training speech is used
executed. The training process involved, mainly include the task involved with Content
Development part mentioned earlier in part 4.2.2. Here, the recitors need to train/repeat a
set of word/phoneme or phrase of the Quranic recitation, and adjusting its comparison
algorithm to be match with initial training data set. Each word or phoneme from the
90
vocabulary will be connected to Hidden Markov Model, using the values obtained from
HMM modeling, such as A, mu and sigma. The values obtained from HMM Modeling (A,
Each line of the Quranic recitation (represent in array value of input sample) based
on ayates is arranged in sequence, line by line in MATLAB array editor. Based on the set
of phonemes arrangement on that array editor, the value of (A, mu and sigma) will be
obtained from HMM modeling (HMM training), based on the line specified for each ayates
in certain sourate (Al-Fatihah). The values for each line in particular phonemes of ayates,
will be used as reference patterns, where the looping process of new inputs of the Quranic
recitation will be executed line by line, alternately work altogether with reference patterns
until the looping process ended (based on line parameter set) and completed.
The values obtained at this stages, will be known as recognition results. The results
were obtained after a real time acquisition of the Quranic recitation, a speech processing
stage and HMM modeling executed. Then, the process continues with recognition
procedure, where the values are compared with all codebook models (reference patterns),
and due to get the Maximal Likelihood Ratio. Those maximal values were only
91
4.6 Tajweed Checking Rules Graphical User Interfaces
The automated Tajweed Checking Rules engine which had been design here,
become crucial to be implemented. The system designed, must be flexible and user friendly
to be used, and also easy to be visualized by the user. Thus, there is a way that probably can
be used for this projects’ development, through the Graphical User Interface (GUI).
In this part, both logical and physical aspect of Tajweed checking rules engine were
presented and visualized using the Graphical User Interface (GUI). Besides, the
understanding of the Tajweed Checking Rules engine functional requirement also described
Figure 4.7: Automated Tajweed Checking Rules Engine for Quranic verse Recitation
Graphical User Interface
92
Figure 4.8 shown below were the list of the selected item for particular ayates that
have been recorded before and need to be load into the engine, for further processing for
Figure 4.8: Load the wave file of input speech sample from sourate Al-Fatihah
After the input speech sample has been selected and load into the system, the
process continued with analyzing process for further processing. The process in analyzing
part is mainly to extract the features extraction from the sourate Al-Fatihah of input speech
sample, due to obtain the features vector. The GUI visualization of this part can be seen
93
Figure 4.9: Analyzing process of sourate Al-Fatihah using MFCC (Started)
94
Speech Sample of Quranic Recitation
1
0.5
Amplitude 0
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
6 3
2
4
1
2
0
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Figure 4.11: The input speech sample and spectrogram graph for ‘Bismillah’ utterance
Then, after the analyzing process successfully completed, the process will proceed
for matching analysis. Meaning that, in this part the Tajweed checking process will be
executed. If the Quranic recitation was pronounced incorrectly, this engine will notify the
user (recitor) any false word/s for the certain ayates involved. The engine will show errors
on the Graphical User Interface (GUI) for any incorrect recitation of Al-Quran. Next, the
engine will guide the user (recitor) towards the correct ayates that need to be followed or
recite orderly with playback the correct recitation. This description can be seen through
these GUI shown in figure 4.12, figure 4.13, figure 4.14 and figure 4.15.
95
Figure 4.12: The incorrect recitation of ‘Bismillah’ utterance (1st mistake/notification)
Figure 4.13: The incorrect recitation part involved and Tajweed rules
96
Figure 4.14: The incorrect recitation of ‘Bismillah’ utterance (2nd mistake/notification)
Figure 4.15: The incorrect recitation part involved and Tajweed rules
97
Figure 4.16: The correct recitation of ‘Bismillah’ utterance
But, in other hand, if the Quranic recitation is correct, in term of its recitation as well its
tajweed rules, the engine will give a result and shown the match based on the ayates,
recited by the user (recitor). It can be visualized based on the figure shown below:
98
Figure 4.18: The correct recitation of ‘Arrahmaanirrahiim’ utterance
4.7 Summary
This chapter presented for both logical and physical aspects of Automated Tajweed
Checking Rules for Quranic verse Recitation and provided a visualization of the main
graphical user interface of the engine. It also provided an understanding of the Automated
Tajweed checking rules for Quranic verse recitation, functional requirements through
graphical representations.
99
CHAPTER 5
5.1 Introduction
In this chapter, relevant experimental result will be shown based on the findings of
this research, and discussed in details based on the system chronology as described in
methodology part in chapter 3 and chapter 4. The aims of this task is to clearly show the
experimental results starting from the collection of speech samples, followed by features
extraction, then features training and lastly features matching/testing. The last part of this
method of features matching/testing is the main part that evaluates the performance of
In this section, the main process concern is focus towards the collecting of speech
samples from 5 different speakers (recitor) through recording process. Each of distinct
word (ayates in sourate Al-Fatihah) will be recorded, and those speech samples were saved
for further processing. The numbers of speech samples collected were 52 words (ayates)
and 82 probable samples of phonemes for those ayates in different samples of Quranic
recitation. These samples will be used in training of Hidden Markov Model (HMM) and
also in the testing part. Speech samples were recorded in a constraint environment, where 5
100
selected speakers (recitors) were choose and highly trained in Quranic recitation based on
the ‘Tajweed rules’. Here, the first chapter of Al-Quran (Al-Fatihah) were recited, with
approximately recited 4 seconds (time length) each, in ‘.wav’ of file format. Table 5.1
SiraathollazinaAn’amta’Alai Siratho
101
him Allatheena
ghayrillmaghdoobi’Alaihim An’Aamta
waladdholeen ‘AAalayhim
(fatihah6.wav Ghayri
& fatihah7.wav) Almaghdoobi
‘AAalayhim
Wala
Alddhalleena
For phonemes templates, each ayates of sourate Al-Fatihah sound files were segmented
into individual files, by cutting the desired part or specified region (Region of Interest)
only, using the GoldWave Editor. The parameter of these inputs is equally set up into the
same parameter, due to avoid any inconsistency value of incoming results. The summary of
the collected phoneme from the speech samples of 8 ayates is listed in table 5.2.
Table 5.2: Summary of the Total Collected Speech Samples for each Ayates
102
5.3 Result of Feature Extraction
algorithm for feature extraction has been presented. The process of feature extraction was
applied to all 52 and 82 collected of Quranic recitation of speech samples. Here, MFCC
cepstral coefficients values will be obtained from each input of speech sample, which then
been transformed into the output of features vector format. Based on the result, the column
data end up was 398 columns with 13 features vector (12 coefficients + 1 log energy).
After feature extraction process executed, the recognition process will compare the
extracted features with its reference model. This reference model is developed, after the
enrolment or training phase had been successfully implemented. In this case, the reference
model (stored model in database) used consist of 2 types of models, which are Word based
Model and Phoneme based Model. The reference model for phoneme based model is
totally differs from word based model, where speech features that have been extracted are
directly compared to the word templates. Here, each of word templates in direct matching
model were stored as a vector of features parameters. Word based model has been used as a
first model, while phoneme based model been the second model used as template matching
at testing/recognition part. The phoneme based model was phoneme like template
matching, where the word templates are stored as phoneme like template parameters. For
the phoneme based model, it will be discussed in details at Tajweed checking rules
103
Here, the experimental result of performing the Hidden Markov Model (HMM) for
features training has been completely done. Features vector produced by Mel Frequency
Cepstral Coefficient (MFCC) will be combined due to create the database that serves as
HMM model, specifically used for provide the template matching while training the data.
Each of distinct words of features vectors will be combined altogether due to create a
database for that particular distinct word, where the value of ( A, pi 0, mu , sigma) is
evaluated and stored in database. Table 5.3 shows the result of creating the HMM model
for the particular recitation of Al-Quran during the enrolment or training phase in (.mat)
format. Each distinct word in the dictionary was trained against the initial template of
Table 5.3: Template Data of HMM Model for Collected Quranic Recitations
The word in the dictionary Wave file HMM Model HMM Model
assigned (Word/ayates like (Phonemes
template) like
template)
104
Iyyakana’Abudu waiyyaka fatihah4.wav ayat4_model.
nastaeen mat
SiraathollazinaAn’amta’Ala fatihah6.wav
ihim & ayat6_model.
ghayrillmaghdoobi’Alaihim fatihah7.wav mat
waladdholeen
As mentioned earlier, the system contains 2 separate template of HMM model from
the training corpus. The first model stands for the Word (ayates) Template, while the
second model is for the Phoneme-Like Templates. The training corpus used 2 tests to
compose the samples of Quranic recitation. Those tests can be seen in part 5.5 and will be
discussed later. From the corpus, 82 samples of Quranic recitation phonemes like templates
are produced and converted into phoneme strings using the Quranic pronunciation rules.
Those templates were particularly taken from 8 words (ayates) of sourate Al-Fatihah, then
those template will manually arranged into 7 files of model and stored into the database as
HMM model (.mat) shown in table 5.3. These models not just recognize the phonemes but
also checks for the tajweed rules that govern the recitation of Al-Quran. Each of
experiment executed, both the training and word templates uttered are from the same
speaker (recitor).
105
5.4.1 Tajweed Checking Rules Database
utterances, meanwhile another 28 phonemes from those ayates with 82 samples of input
phonemes. This engine scans those input of Holy Quran Ottoman sound and text, searching
for symbols and features, where it will generates its code, pronunciation status, as well as
articulation, nasalization and aspiration). Then, the engine will analyze those codes and
according to the Quranic recitation rules and their exceptions. The HMM
enrolment/training part, will gathers all the information, due to develop phoneme based
patterns are used for matching with pronunciation variants rules, during the matching and
testing process. Here, the engine database contains 10 rules of pronunciation errors in the
Quranic recitation and those rules were presented the way of these recitation errors
, , ,
Idgham Syamsi : alif lam
, , ,
meet ra;,alif lam meet
dal; alif lam meet syad;
,
alif lam meet zai;
106
Mad ‘arid Lissukun:
, , , , Letter of mad has been
Waqf (Stop)
, ,
Izhar Syafawi: min
sukoon meet dal;min
sukoon meet ta’;min
, , , sukoon meet ghim; min
sukoon meet wau
Izhar Qamari: alif lam
meet ‘ain
Izhar Halqi, nun sukoon
meet ‘ain
Beside the tajweed rules listed above, other 4 additional Ahkam al-Tajweed also checked
such as Iqlab, Idgham Bilal Ghunna, Idgham Ma’al Ghunna and Ikhfa’ Haqiqi.
The experimental results of performing the MFCC algorithm for features extraction
from the Quranic recitation of speech samples and then, matching/testing against the
trained HMM (Hidden Markov Model) model of data templates, using the same
classification of HMM method. As mentioned earlier in part 5.4, those data templates also
word (ayates) template). Both templates were used as reference model (template matching),
purposely for recognition task. In this task, any input that passing through this engine will
107
be compared with the stored template, and any template that most closely match with the
The automated Tajweed checking rules engine will act upon any Quranic recitation,
whenever it receives the input speech signal because any speech that passing through the
system will give an output score and cause the engine to make judgements. Thus, the score
value measuring the confidence of a recognized word needs to find out. Besides, those
ayates and phonemes has been classified under 2 different probabilities, either In
Vocabulary (IV) of data or Out of Vocabulary (OOV) data, due to ensure that the engine
compatible in checking the tajweed rules. The basic idea for separating the IV and OOV
phonemes/words are the likelihood difference between the best and 2nd best result of IV
input are smaller than those of OOV input, because of unmatched model of OOV input. As
mentioned earlier in chapter 3, the standard Log Likelihood Ratio (LLR) and augment LLR
N, is the length of input utterance, log Pbest O is the largest LLR and log P2ndbest O is
According to Yongwon, J. et al. (2001), log likelihood of the word itself is not
appropriate to acts as measurements, due to the setting the threshold value. Thus, the above
equation (1) is used in their research. In this case, the equation (1) shown above need to
108
changed a little bit, the recognition result obtained would not change too much, due to the
relative large likelihood differences between best and 2nd best of the results. But, in this
case of OOV word/phoneme, high probably that the result of changed input may be
difference from the original input. Because of that, the perturbed input need to be employed
in order to improve the robustness of confidence score. Here, several methods have been
coef1 = k1*coef;
coef2 = coef – k2*mc;
coef3 = coef – k3*Oc;
Based on the above formulas, coef: feature vector, mc: mean vector of feature vector for
input speech and Oc: standard deviation vector of feature vector for input speech. In
other hand, k1, k2 and k3 are the constant values which need to be adjusted, so that the
percentage of the divergence between the recognition results of original and perturbed input
of features vector will remain < 10%, especially for the IV word/phoneme. After the coef2
been perturbed, if the recognition result is change, then a certain values need to be added to
LLRA = LLR + k; if Wo = Wp
LLRA = LLR; if Wo ≠ Wp
Formula; Wo: recognized word from the original input feature vector. Wp: recognized
word from the perturbed input feature vector. Here, the threshold value for the LLRA is set
by training the IV input and OOV input. After that, the LLRA result will be obtained, after
109
the testing process had successfully executed. If the value of LLRA > threshold, it is
OOV word/phonemes. But, this threshold setting arguments can be changed, depend on the
MATLAB program developed. After the implementation of LLRA had been carried out, the
result obtained, did not give the perfect result, especially for the recognition of Tajweed
Checking rules, which been presented in term of phonemes. LLRA was presented and has
been used by Yongwon, J. et al. (2001) before, but for the single or direct word recognition.
In this case, LLRA is not suitable to be implemented to the phonemes input, because
the LLR value for IV phoneme and OOV phoneme are almost the same. Thus, another
alternative of LLR has been made, through the implementation of the LLR ratio with the
Diff_ratio = [log .P(best O) log .P(2nd best O)] / log .Pbest O (2)
There are 2 tests performed on this system, in order to evaluate the system
performance. As mentioned earlier in part 5.4, in every experiment done, both training and
word templates uttered are from the same speaker. Table 5.6 and table 5.9 show the overall
In this part, the LLR threshold value is -1100 with the different ratio value of 0.2. If
the value of LLRA >-1100, it is considered as IV word, meanwhile, if LLRA < -1100, it will
considered as OOV word. Moreover, result obtained after the implementation from the
110
equation (2), gives the values of diff_ratio of IV almost bigger than 0.2, while most of
OOV input give the values less than 0.2. It can be seen through the result obtained based on
table 5.5 below, where the value highlight with red color represent the LLR value above
0.2 for the IV words. Meanwhile, other LLR values highlight with blue color, represent the
OOV words (LLR values less than 0.2). Meaning that, all 8 ayates of Sourate Al-Fatihah
shown below, were categorized under the IV words. In relation with the application of this
engine, whenever any of input claimed as OOV word/ayates, there is notification of the
references, made for evaluation purposes. Whenever an IV input identified as an IV, there
Table 5.5: Result of Likelihood Ratio (LLR) for 8 recitations of speech samples (1.0 x 103)
Sequence x1 x2 x3 x4 x5 x6 x7 x8
MLM 1 3 6 7 8 5 2 4
MLM 2 7 8 5 6 1 4 3
MLM 3 1 6 7 2 8 4 5
MLM 4 6 1 5 7 3 8 2
MLM 5 6 7 1 8 3 4 2
111
logP(X│Θ6) 0.2667 -4.4097 -4.8590 -4.8904 -4.9843 -5.3690 -5.8303 -7.7457
MLM 6 7 5 8 1 3 4 2
MLM 7 6 8 1 5 3 4 2
MLM 8 1 7 5 6 4 3 2
Table 5.6: Test result for 8 recitations of speech samples (ayates of sourate Al-Fatihah)
%
Ayates/Articulation # of Correct Wrong % Word
utterances Accuracy error
rate
5 5 0 100 0
5 5 0 100 0
7 7 0 100 0
6 6 0 100 0
9 8 1 88.89 11.1
9 9 0 100 0
6 4 2 66.67 33.33
112
5 4 1 80 20
For the first test, 8 ayates of sourate Al-Fatihah have been tested and the result of
the test is shown above at table 5.6. In this experiment, the extracted features of 8 ayates of
the Quranic recitation was directly compared to the word templates (Word based Model).
As a result, the test result on the training data is perfectly reached at 91.95%, which means
only 4 errors with 8.05% of Word Error Rate (WER). It is better than the result of the
previous researches, carried out by Ehab, M. et al. (2007) and Anwar, M.J. et al. (2006),
with the accuracy rate of recognition are 85% and 89% respectively.
order to check the Tajweed rules for the particular ayates of Quranic recitation given. Note
that, the threshold value for phonemes like template experiment is -500, with the value of
different ratio is 0.01. However, the threshold setting value is totally different from the
previous testing process which, if the value of LLRA >-500, it is considered as OOV
utterance has been detected as OOV phoneme, the identification and verification process of
pronunciation rules error (Tajweed rules) will be executed. Meaning that, the pronunciation
for the particular Quranic recitation is detected as false/incorrect. Table 5.7 and table 5.8
shown below are the experimental results for the two sample phonemes of ‘Bismillahir
understanding.
113
Table 5.7: Comparison between correct and incorrect Tajweed rules for ayates “Bismillahir
<rahmaanir> rahimi”
Correct Recitation Incorrect Recitation
Ayates
The
utterances Bismillahir RAHMAANIR rahimi Bismillahir RAHMUUNIR rahimi
(Articulation)
Score: Score:
Output Score 1.0e+003 * 1.0e+003 *
Columns 1 through 3 Columns 1 through 3
-1.5521 -1.3030 -2.1808 -1.2703 -0.9670 -2.2708
Columns 4 through 6 Columns 4 through 6
-2.2018 -0.7968 -1.1091 -1.4974 -0.8738 -0.9279
Columns 7 through 9 Columns 7 through 9
-0.6398 -0.6541 -0.5685 -0.5621 -0.7777 -0.8362
Columns 10 through 12 Columns 10 through 12
-0.8995 -1.0463 -0.8684 -0.7123 -0.9958 -0.7422
Columns 13 through 15 Columns 13 through 15
-1.1624 -0.6604 -1.0446 -0.9294 -0.4929 -1.0265
Columns 16 through 17 Columns 16 through 17
-0.6033 -0.7845 0.0544 -0.9155
LLR: LLR:
Log-likelihood 1.0e+003 * 1.0e+003 *
(LLR) Columns 1 through 3 Columns 1 through 3
-0.5685 -0.6033 -0.6398 0.0544 -0.4929 -0.5621
Columns 4 through 6 Columns 4 through 6
-0.6541 -0.6604 -0.7845 -0.7123 -0.7422 -0.7777
Columns 7 through 9 Columns 7 through 9
-0.7968 -0.8684 -0.8995 -0.8362 -0.8738 -0.9155
114
Columns 10 through 12 Columns 10 through 12
-1.0446 -1.0463 -1.1091 -0.9279 -0.9294 -0.9670
Columns 13 through 15 Columns 13 through 15
-1.1624 -1.3030 -1.5521 -0.9958 -1.0265 -1.2703
Columns 16 through 17 Columns 16 through 17
-2.1808 -2.2018 -1.4974 -2.2708
Tajweed
Rules - Mad Asli Mutlak
Table 5.8: Comparison between correct and incorrect Tajweed rules for ayates “Bismillahir
rahmaanir <rahiimi>”
Correct Recitation Incorrect Recitation
Ayates
The
utterances Bismillahir rahmaanir RAHIIMI Bismillahir rahmaanir RAHUUMI
(Articulation)
Score: Score:
Output Score 1.0e+003 * 1.0e+003 *
Columns 1 through 3 Columns 1 through 3
-2.0779 -1.6710 -1.9139 -1.8007 -1.6138 -2.5081
Columns 4 through 6 Columns 4 through 6
-2.0321 -1.2066 -1.1630 -2.7402 -0.9999 -1.0721
Columns 7 through 9 Columns 7 through 9
-1.1592 -1.2839 -0.8137 -0.6768 -0.7334 0.0782
Columns 10 through 12 Columns 10 through 12
-1.5029 -1.6649 -1.5198 -1.1591 -1.3392 -1.0342
Columns 13 through 15 Columns 13 through 15
-1.6956 -0.8598 -1.4082 -1.4091 -0.7923 -1.4355
115
Columns 16 through 17 Columns 16 through 17
-1.1358 -1.7441 -0.6912 -0.9625
LLR: LLR:
Log-likelihood 1.0e+003 * 1.0e+003 *
(LLR) Columns 1 through 3 Columns 1 through 3
-0.8137 -0.8598 -1.1358 0.0782 -0.6768 -0.6912
Columns 4 through 6 Columns 4 through 6
-1.1592 -1.1630 -1.2066 -0.7334 -0.7923 -0.9625
Columns 7 through 9 Columns 7 through 9
-1.2839 -1.4082 -1.5029 -0.9999 -1.0342 -1.0721
Columns 10 through 12 Columns 10 through 12
-1.5198 -1.6649 -1.6710 -1.1591 -1.3392 -1.4091
Columns 13 through 15 Columns 13 through 15
-1.6956 -1.7441 -1.9139 -1.4355 -1.6138 -1.8007
Columns 16 through 17 Columns 16 through 17
-1.7441 -1.9139 -2.5081 -2.7402
According to LLR result obtained for both table 5.7 and table 5.8 shown above, the
value highlighted with red color represents the IV phoneme (LLR value less than 0.01).
Meanwhile, the LLR value highlight with blue color represents the OOV phoneme, with
the value above 0.01. In this case, two different phonemes from the ayates “Bismillahir
tested. The result obtained for both phonemes are -0.5685 and -0.8137, which are below
the LLR threshold value (LLR<-500) and been classified under the IV phonemes (Correct
recitation). In other hand, the LLR values highlight with blue color (0.0544 and 0.0782)
were categorized as OOV phonemes (Incorrect recitation), since these values were
116
specified above the LLR threshold value (LLR >-500). For the first phoneme, the incorrect
recitation of tajweed pronunciation error is claimed as ‘Mad Asli Mutlak’, where the
2 haraakat of recitation. Besides that, the pronunciation for the 2nd phoneme also has been
detected as false regardless to the tajweed rule, which claimed as Mad ‘arid Lissukun (letter
of mad has been Waqf (Stop)), since the phoneme of need to be pronounced
The results obtained from sample phonemes, shown in table 5.7 and table 5.8 were
purposely to check the Tajweed Rules in this sourate. In this experiment as shown at table
5.9 below, features vector from the input phonemes was perfectly match the phoneme
based template with the percent accuracy reached to 86.41%, with 14.34% of error rate
only. Although the percent accuracy in this experiment quite smaller compared to previous
result in table 5.6, but the result is still under the expectation. It is because, this experiment
involved with a large amount of samples, particularly for testing purposes. From the
experiment, the current method used is much simpler than LLRA, which only need to
117
Table 5.9: Test result for 28 recitations of speech samples (Phonemes)
# of % %
Ayates Phonemes utterances Correct Wrong Accuracy WER
Bismi
Bismillah.wav Llahii 17 16 1 94.12 5.88
Rraohimani
Rraohiiim
Allhamdu
fatihah1.wav Lillahhirabbil 9 8 1 88.89 11.1
A’alamiinna
Alrrahmani
fatihah2.wav Alrraheemi 6 6 0 100 0
Maaliki
fatihah3.wav Yawmi 8 8 0 100 0
Alddeeni
Iyyaka
fatihah4.wav naA’Abudu 11 8 3 72.72 33.3
waiyyaka
nastaAAeenu
Ihdina
fatihah5.wav Alssiratho 9 8 1 88.89 11.11
Almustaqeema
Siratho
Allatheena
fatihah6.wav An’Aamta 12 8 4 66.67 33.33
‘AAalayhim
Ghayri
Almaghdoobi
‘AAalayhim
fatihah7.wav Wala 10 8 2 80 20
Alddhalleena
Total 28 82 70 12 86.41 14.34
118
The figure 5.1 and figure 5.2 shown below are the bar chart represented for the
percentage of Accuracy and Word Error Rate (WER) for ayates and phonemes, which have
been summarized from the overall result of the previous experiments for both ayates and
phonemes.
Figure 5.1: Percentage of accuracy for recognition rate (Ayates & Phonemes)
119
Ayates/Phonemes (.wav)
Figure 5.2: Percentage of Word Error Rate (WER) for ayates & phonemes
Based on the figure 5.1, there are 2 ayates achieved the 100% accuracy for both
ayates and phoneme, which are fatihah 2 and fatihah 3 (ayates 2 & 3). Hence, the WER for
these ayates will be remaining 0%, without any error detected. This can be seen through the
bar chart shown in figure 5.2. This case probably because, those ayates were the short
sentences and phonemes, which avoidable from any complexity during the matching and
recognition process took place. Meanwhile, fatihah 6 (ayates 6) achieved the smallest
percentage of accuracy, in which these ayates and phonemes only reached 66.67% of
accuracy value, but the percentage of WER value reached into the highest percent of
33.33%. The rationale behind this result, probably because of the complexity in
pronouncing this ayates, as well as the difficulties in matching and recognizing the exact
utterance properly.
120
5.6 Summary
The overall process conducted in this research is shown clearly in previous chapter
4 of Data Flow Diagram (DFD) and flowchart for Tajweed Checking Rules Engine. Based
on this DFD and flowchart, the process involved in this research, were clearly seen and
justified.
The experimental results obtained had fulfilled the targeted criteria and goals that
had been set and planned earlier, although there are some limitations of unexpected
121
CHAPTER 6
6.1 Introduction
In this chapter, the process involved in this project will be discussed briefly and any
recommendation for the future enhancement for the overall research will be presented, due
to make this system more eagerly efficient and sophisticated. It also highlights the
significance and contributions of this research. In addition, it will explain elaborately about
the weaknesses and also the strength of this research, as well as propositions for the
improvement and future works. Lastly, at the end of this chapter the entire research will be
6.2 Significance and Contributions of Tajweed Checking Rules engine for Quranic
verse recitation
(i) Provides alternative way to learn Al-Quran recitation for creating knowledgeable
society.
(ii) Facilitate students in reading Al-Quran with their own pace and time. This engine
also can be as a self learning tool for working adults with time constraints to learn
Al-Quran.
122
(iii) Capabilities of the system created, due to check the tajweed rules based on stored
database.
(iv) Enhance a better skills and understanding of Quranic reading with the faster way.
(v) Promote Quranic literacy and explore new approches in signal processing
technology.
(vi) Support the Quranic learning process, especially in j-QAF educational programme.
A complimentary school program that utilizes current ICT development and j-QAF
(vii) Encourage Muslims to advance their recitations as well as new converted Muslims
and students to learn and practice Islam in a more convenient and effective way.
Different observers or researchers have different opinions and views, while testing
and evaluate this system. Moreover, that is normal when the particular system invented also
6.3.1 Strengths
implemented able to achieve the targeted objectives with its own strengths. The strength(s)
123
(i) In this modern and technological era, speech interaction system is believed able to
achieve the users’ targeted objectives in a very easy and fast manner. Besides, the
interactive speech recognition system will ease and fasten the communication
process.
(ii) The automated Tajweed Checking Rules engine for Quranic verse recitation will
enable the user to recite Al-Quran through the MATLAB Graphical User Interface
(GUI), hearing the correct recitation and hence, determine the proper way to recite
(iii) This interactive engine is a self learning educational tool that can support the
students in j-QAF learning, especially in learning Al-Quran (Tasmik & Khatam al-
Quran model). Besides, this engine also able to put some ease to the j-QAF teachers
(iv) This project will allow a chance for more researchers to get involved to the project
done by University of Malaya students, since the students may able to refer to this
project particularly for their own benefits in developing the system with the same
nature.
(v) Allow the interchange of ideas and collaboration between 2 or more faculty
124
(vi) The engine developed, shows the promising results in which almost the exact match
(vii) This research shown that, the combination of MFCC feature extraction and HMM
(viii) Most challenging task in this research is to implement Al-Quran with speech
recognition system, altogether with the engine capability in checking the tajweed
rules. But, this engine able to achieve recognition rate that exceeded 91.95%
(ayates) and 86.41% (phonemes), which indicates that the engine was successful.
6.3.2 Weaknesses
Throughout these years, the research conducted also facing some problems and
difficulties, due to some limitations and weaknesses in speech recognition research area.
(i) The implementation of Quranic with speech recognition system is not an easy job to
technology is still new in the market. The software and hardware required might be
125
(ii) Most of past research executed and implemented to English language only.
(iii) Speaker recognition is a difficult task. It is very hard to get the exact match with
high accuracy rate in many cases especially during the training and testing sessions.
It is because, during these sessions it can be greatly different due to many fact such
as, human voice change with time, health conditions (e.g. the speaker has a cold),
According to the research conduct, the engine developed showed the promising
results although it was only tested against the small Quranic chapters (ie: Sourate Al-
Fatihah). But, there are still in earlier stage of research, which need the proper
attention/action and improvement due to make this engine more compatible and useful to
the end users. The Quranic implementation in speech recognition system, especially in
checking the Tajweed rules always be a new developments in this technology in which
allow more researchers and creativity to get involved. Many things need to be considered
due to improve the system further in the future. Below are the proposed tasks for the
(i) The engine shall be able to accept more test cases from the various users of Quranic
recitations inputs. Here, the engine must be multi-users, which accept any voice
126
(ii) The engine shall be integrated into the hardware part, which allows the users to use
the engine in real system (portable device) rather than using a simulation of the
engine. However, the integration process is believed could be costly and very time
6.5 Conclusion
This research has covered many aspects of speech recognition system and this
research finding will be highly beneficial, due to learn Al-Quran with more interesting
manner, while complying with the established Islamic ways and rules. For recognition
purposes, the recitors recitation scoring was evaluated against the database system for the
addition, this research has successfully achieved their objectives and hopefully, it will give
a lot of benefits to the end users as it is designated for that purposes. However, this
automated Tajweed Checking rules engine for Quranic verse recitation had shown the
strength and weaknesses after the engine has been successfully developed. The
achievements of this engine are very valuable indeed, as it will be references to other
researchers and the developers of such a system in the future. It is very much hopes that the
engine will be implemented in real life and been integrated with the hardware system.
127
REFERENCES
Ahmad, A.M., Ismail, S., Samaon, D.F., 2004, 'Recurrent Neural Network with
Backpropagation through Time for Speech Recognition,' IEEE International
Symposium on Communications & Information Technology, 2004. ISCIT ‘04. Volume 1,
pp. 98 – 102.
Ahmed, M.E., 1991,” Toward an Arabic Text-To-Speech system.” The Arabic Journal
Science and Engine, 1991.
Anwar, M.J., Awais, M.M., Masud, S. & Shamail, S.,” Automatic Arabic Speech
Segmentation System.” Department of Computer Science, Lahore University of
Management Sciences, Lahore, Pakistan.
Bashir, M.S., Rasheed, S.F., Awais, M.M., Masud, S., & Shamail, S., 2003,'Simulation of
Arabic Phoneme Identification through Spectrographic Analysis,' Department of
Computer Science, University of Engineering & Technology, Lahore Pakistan, Lahore
Pakistan.
Bateman, D. Bye, D. and Hunt, M., 1992, 'Spectral Constant Normalization and Other
Techniques for Speech Recognition in Noise,” Proc. IEEE.Inter.Conf. Acoustic.
Speech Signal Process, vol.1, pp. 241-244, 1992.
Chetouani, M., Gas, B., Zarader, J.L. & Chavy, C., 2002, ‘Neural Predictive Coding for
speech Discriminant Feature Extraction: The DFE-NPC’, ESANN’2002 Proceedings
– European Symposium on Artificial Neural Network, Bruges, Belgium, pp. 275-280.
Ehab, M., Ahmad, S. and Mousa, A. 2007,'Speaker Independent Quranic Recognizer Based
on Maximum Likelihood Linear Regression,' Proceedings of World Academy of
Science, Engineering and Technology Volume 20 April 2007.
Essa, O., 1998, ‘Using Prosody in Automatic Segmentation of Speech’, Proceeding 36th
ACM Southeast Regional Conference, pp. 44 - 49, April 1998.
128
Essa, O.,”Using Suprasegmentals in Training Hidden Markov Models for
Arabic."Computer Science Department, University of South Carolina, Columbia.
Habash, M., 1986, “How to memorize the Quran”, Dar al-Khayr, Beirut 1986.
Hansen, J.C., 2003, ‘Modulation based parameter for Automatic Speech Recognition’,
Master Thesis of Department of Electrical Engineering, University of Rhode Island,
USA.
Hasan, M.R., Jamil, M., Rabbani, M.G. & Rahman, M.S., 2004, ‘Speaker Identification
Using Mel Frequency Cepstral Coefficients’, 3rd International Conference on
Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka,
Bangladesh ISBN 984-32-1804-4 565.
Hemantha, G.K., Ravishankar, M., Nagabushan, P. & Basavaraj, S.A., 2006, ‘Hidden
Markov Model based approach for generation of Pitman shorthand language symbols
for consonants and vowels from spoken English’, Sadhana – June 2006. Vol. 31, part
3, pp. 227-290.
Hermansky, H., 1990, ‘Perceptual linear predictive (PLP) analysis of speech’, The Journal
of the Acoustical Society of America -April 1990. Volume 87, Issue 4, pp. 1738-1752.
Hosom, J.P., Cole, R. and Fanty, M. 1999, Speech Recognition Using Neural Networks at
the Center for Spoken Language Understanding, Center for Spoken Language
Understanding (CSLU) Oregon Graduate Institute of Science and Technology, July 6,
1999.
Huang, X., Acero, A., & Hon, H.W., 2001, Spoken Language Processing: A Guide to
Theory, Algorithm and System Development, Prentice Hall, Upper Saddle River, NJ,
USA.
129
Institute for Research in Islamic education (Newspaper), 2007, The New Strait Times
Press-26 September 2007 [Online] Available at: http://www.nst.com.my/ retrieved on
20 November 2007.
J. de Veth and L. Boves, 1998, ‘Channel normalization techniques for automatic speech
recognition over the telephone’. Speech Communication 25 (1998) 149-164.
Jurafsky, D. & Martin, J.H., 2007, Automatic Speech Recognition: Speech and Language
Processing: An Introduction to natural language processing, computational linguistics,
and speech recognition, Prentice Hall, New Jersey, USA.
Khalifa, O., Khan, S., Islam, M.R., Faizal, M. & Dol, D., 2004, ‘Text Independent
Automatic Speaker Recognition’, 3rd International Conference on Electrical &
Computer Engineering, Dhaka, Bangladesh, pp.561-564.
Kirchhoff, K., Bilmes, J., Das, S.,Duta,N., Egan,M. Ji,G. He,F.,Henderson,J., D. Liu, M.
Noamany, P. Schone, R. Schwartz, D. Vergyri, 2003, ‘Novel approaches to Arabic
speech recognition: report from the 2002 Johns-Hopkins Summer Workshop’, IEEE
International Conference on Acoustics, Speech, and Signal Processing, 2003
Proceedings. (ICASSP '03). Volume 1, 6-10 April 2003, pp. I-344 - I-347 vol.1
Kirchhoff, K., Vergyri, D., Bilmes, J., Duh,K. and Stolcke, A. 2004,'Morphology-based
language modeling for conversational Arabic speech recognition,' Eighth
International Conference on Spoken Language ISCA, 2004.
Lee, K.F. & Hon, H.W., 1989, ‘Speaker-Independent Phone Recognition Using Hidden
Markov Models’, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol.
31, pp.1641-1648.
Levent, M.A., 1996, “Foreign Accent Classification in American English.” Dissertation for
Doctor of Philosophy in Department of Electrical & Computer Engineering, Graduate
School of Duke University, Durham, USA.
Linde, Y., Buzo, L. & Gray, R.M., 1980,” An algorithm for Vector Quantizer Design”,
IEEE Transactions on Communications, Vol.COM28,no 1,pp.84-95.
Madisetti, V.K. & Williams, D.B., 1999, Digital Signal Processing Handbook,
CRCnetBASE. CRC Press LLC, USA.
130
Martens, J.P., 2002, 'Continuous Speech Recognition over the Telephone', Electronics &
Information Systems, Ghent University, Belgium. Available at:
http://trappist.elis.ugent.be/ELISgroups/speech/cost249/report/intro.pdf retrieved on 10
September 2008.
Matsui, T., & Furui, S., 1993, 'Comparison of text-independent speaker recognition
methods using VQ-distortion and discrete/continuous HMMs'. Proceedings of the 1993
International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
Institute of Electrical and Electronic Engineers. Minneapolis, Minnesota, pp.157 – 160.
Maamouri, M., Bies, A. and Kulick, S., 2006,'Diacritization to Arabic Treebank Annotation
and Parsing,' Proceedings of the Conference of the Machine Translation SIG, 2006.
Nathan, K., Beigi, H.S.M. and Subrahmonia, J., 1995, ‘On-line Unconstrained Handwriting
Recognition Based On Probabilistic Techniques’.
Nelson & Kristina, 1985, ‘The art of Reciting the Quran’, University of Texas Press, 1985.
Owen, F.J., 1993, ‘Signal Processing of Speech’. Macmillan Press Ltd., London, UK.
Prime Minister's Office of Malaysia 2006, Ninth Malaysia Plan 2006 – 2010, Chapter 11:
Enhancing Human Capital. Available at: http://www.epu.jpm.my/rm9/english/
Chapter11.pdf retrieved on 18 November 2007.
Program j-QAF sentiasa dipantau (Newspaper), 2005, Berita Harian Press-10 May 2005
[Online] Available at: http://www.bharian.com.my/ retrieved on 18 November 2007.
Penutupan Majlis Tilawah al-Quran (Newspaper), 1995, Utusan Malaysia-10 January 1995.
Retrieved on 18 November 2007.
Rabiner, L.R. & Juang, B.H., 1993, ‘Fundamental of Speech Recognition’, Prentice Hall,
New Jersey, USA.
Rabiner, L.R., 1989, ‘A Tutorial on Hidden Markov Model and Selected Applications in
Speech Recognition’, Proceeding of the IEEE, Volume 7, No.2, February 1989.
131
Ramzi A.H., Omar E.A., 2007. ‘CASRA+: A Colloquial Arabic Speech Recognition
Application". American Journal of Applied Sciences 4(1):23-32, 2007 Science
Publication.
Sari, T., Souici, L. and Sellami, M., 2002, ‘Off-Line Handwritten Arabic Character
Segmentation Algorithm: ACSA’, Proc. Int’l Workshop Frontiers in Handwriting
Recognition, pp. 452-457, 2002.
Shen, J., Hung, J. & Lee, L., 1998, ‘Robust Entropy-based Endpoint Detection for Speech
Recognition in Noisy Environments’, 5th International conference ICSLP ’98, Sydney,
Australia, 1998.”
Tabbal, H., El-Falou, W. & Monla, B., 2006, 'Analysis and Implementation of a “Quranic”
verses delimitation system in audio files using speech recognition techniques',In:
Proceeding of the IEEE Conference of 2nd Information and Communication
Technologies, 2006. ICTTA ’06.Volume 2, pp. 2979 – 2984.
Thomas, F.Q., 2002, ‘Discrete Time Speech Signal Processing’, Prentice Hall, New Jersey,
USA.
Ursin, M., 2002, ‘Triphone Clustering in Finnish Continuous Speech Recognition’, Master
Thesis, Department of Computer Science, Helsinki University of Technology, Finland.
Viterbi, A.J.,1967, ‘Error bounds for convolutional codes and an asymptotically optimum
decoding algorithm,’ IEEE Trans. Information Theory, vol. IT-13, pp. 260-269, Apr.
1967.
132
Youssef, A. & Emam, O., 2004, ‘An Arabic TTS based on the IBM Trainable Speech
Sythesizer’, Department of Electronics & Communication Engineering, Cairo
University, Giza, Egypt.
Wai, C.C., 2003, Speech Coding Algorithm foundations and evolution of standardized
Coders, John Wiley & Sons Inc.,NJ, USA.
133
APPENDIX A
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
6 3
2
4
1
2
0
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
134
2) Result from 'fatihah1' utterance
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
135
4) Result from 'fatihah3' utterance
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
136
6) Result from ‘fatihah5’ utterance
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
137
8) Result from ‘fatihah7’ utterance
0.5
Amplitude
-0.5
-1
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
Frequency [kHz]
0
0 0.5 1 1.5 2 2.5 3 3.5
Time [sec]
138
APPENDIX B
Journal
Zaidi Razak, Noor Jamaliah Ibrahim, Mohd Yamani Idna Idris, Emran Mohd Tamil, Mohd
Yakub @ Zulkifli Mohd Yusoff, and Noor Naemah Abdul Rahman, 2008 "Quranic Verse
Recitation Recognition Module for Support in j-QAF Learning: A Review" IJCSNS
International Journal of Computer Science and Network Security, VOL.8 No.8, August
2008,(In Press). pp. 207-2016, Journal ISSN: 1738-7906.
Proceeding
1. Noor Jamaliah Ibrahim, Zaidi Razak, Mohd Yakub @ Zulkifli Mohd Yusoff, Mohd
Yamani Idna Idris, Emran Mohd Tamil, "Quranic verse Recitation feature
extraction using Mel-Frequency Cepstral Coefficients (MFCC)", In Proceedings of
the 4th IEEE International Colloquium on Signal Processing and its Application
(CSPA) 2008, 7-9 March 2008, Kuala Lumpur, MALAYSIA.
2. Noor Jamaliah Ibrahim, Mohd.Yakub@Zulkifli Mohd Yusoff & Zaidi Razak, 2008
"Quranic verse Recitation Recognition Module for Educational Programme",
International Seminar on Research in Islamic Studies 2008 @ ISRIS '08, 17-18
December 2008, Kuala Lumpur, MALAYSIA.
Awards
Gold Medal - Mohd Yakub @ Zulkifli Bin Haji Mohd Yusoff, Zaidi Razak, Noor Jamaliah
Binti Ibrahim, Mohd Yamani Idna Idris, Emran Mohd Tamil & Noorzaily Mohamed Noor,
“Effective Learning of Quranic Verse Recitation Using Automated Tajweed Checking
Rules Educational Tools”, 20th International Invention, Innovation and Technology
Exhibition ITEX 2009, Kuala Lumpur, Malaysia, 15-17 May 2009.
139