Sie sind auf Seite 1von 6

Digital Signal Processing 20 (2010) 763768

Contents lists available at ScienceDirect


Digital Signal Processing
www.elsevier.com/locate/dsp
Speech recognition with articial neural networks
Glin Dede, Murat Hsn Sazl

Ankara University, Faculty of Engineering, Electronics Engineering Department, Tandogan 06100, Ankara, Turkey
a r t i c l e i n f o a b s t r a c t
Article history:
Available online 23 October 2009
Keywords:
Neural networks
Speech recognition
Humancomputer interaction
In this paper, articial neural networks were used to accomplish isolated speech recogni-
tion. The topic was investigated in two steps, consisting of the pre-processing part with
Digital Signal Processing (DSP) techniques and the post-processing part with Articial Neu-
ral Networks (ANN). These two parts were briey explained and speech recognizers using
different ANN architectures were implemented on Matlab. Three different neural network
models; multi layer back propagation, Elman and probabilistic neural networks were de-
signed. Performance comparisons with similar studies found in the related literature indi-
cated that our proposed ANN structures yield satisfactory results.
2009 Elsevier Inc. All rights reserved.
1. Introduction
Speech recognition problem is a subbranch of pattern recognition [1]. Some popular techniques to tackle this problem
are:
articial neural networks,
dynamic time warping, and
hidden Markov modeling [2].
ANN have been dened in numerous ways by several scientists [1,3,4]. But the basic fact they all agree is that, neural
networks are made of several processing units named neurons. These processing units are trained using inputoutput data
sets presented to the network. After the training process, the network produces appropriate outcomes when tested with
similar data sets, in other words, recognizes the introduced patterns.
In this study, neural networks were preferred not only for their ease of application but also they yield comparable
and even better results than other methods listed above. A recent study on isolated Malay digit recognition [5] reports
dynamic time warping and hidden Markov modeling techniques to have recognition rates of 80.5% and 90.7%, respectively.
Meanwhile, recognition rates obtained by neural networks for similar applicationsas in this studyare often above. Due to
this aspect, ANN appears to be a convenient classier for the speech recognition problem.
2. Methodology
In this study, a system that recognizes isolated Turkish digits from zero to nine is implemented.
Proposed method consists of feature extraction of speech signals using DSP techniques, then classication of these fea-
tures with an ANN.
Detailed explanation of these parts are presented in the following sections.
*
Corresponding author.
E-mail addresses: gdede@gmail.com (G. Dede), sazli@eng.ankara.edu.tr (M.H. Sazl).
1051-2004/$ see front matter 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.dsp.2009.10.004
764 G. Dede, M.H. Sazl / Digital Signal Processing 20 (2010) 763768
Fig. 3.1. MFC block diagram.
3. Pre-processing
In any speech recognition application, speech signals should rst be properly dened. In other words, information which
specify the signal and only belong to the specic word to be recognized are important, while others can be eliminated.
Then, feature vectors are obtained using this relevant information. Thus, steps from recording of speech signals to feature
extraction are called pre-processing.
Numbers from zero to nine are uttered in Turkish by a female speaker and are recorded by Goldwave program. Recording
parameters are chosen as:
frequency: 11.025 kHz,
sampling rate: 16 bits/s.
In the literature, there are numerous works in which xed recording periods of 0.8 s are used (e.g. [6,7]). Using a different
approach, in this study words are recorded in periods that are proportional to word length. Thus, excessive processing is
avoided. Additionally, by use of beginning and end point detection based on square root summation of signal energies, more
effective data acquisition is obtained [8].
There are several methods for feature extraction [9]. A prominent one of those methods, the one used in this project, is
the MFC (Mel-frequency Cepstrum) algorithm. The block diagram representation of this algorithm is shown in Fig. 3.1.
Speech signal is divided into overlapping frames and these frames are passed through a Hamming window. Dening
function of Hamming window is given in Eq. (1):
w(n) =0.54 0.46cos

2n/(N 1)

, N 1 n 0 (1)
After that, 512 points FFT (Fast Fourier Transform) is applied and the amplitude spectrum of framed signal is calculated.
Let X
n
be the signal to apply FFT and N be the number of samples, FFT of X
n
can be calculated as in Eq. (2):
X
n
=
N1

k=0
x
k
e
2ikn
, n =0, 1, 2, . . . , N 1 (2)
Then, conversion from frequency scale to mel-scale is performed by Eq. (3). This results in the mel-frequency spectrum of
the signal.
f
mel
=2595log
10
(1 + f
linear
/700) (3)
According to Eq. (3), mel-scale shows linear distribution to frequencies below 1000 Hz and logarithmic distribution to
frequencies above 1000 Hz. Thus, a lter much more similar to human ear is obtained [10].
In the last step, conversion to frequency and mel-frequency domains are done respectively, using cepstrum calculation.
Hence, MFC coecients of the signal under study are obtained. In this study, 16 MFC coecients are used.
4. Post-processing
Post-processing consists of the operations in which feature vectors are modeled and words to be recognized are classied
according to these models. In this study, ANN that have shown profound success in recognition problems is used as the
classier.
In literature, several works that use various types of ANN are presented [2,6,7,11]. However, in each of these works and
many others, only one network model is investigated. Here, differently, three ANN models; Multilayer Perceptron (MLP),
Elman and probabilistic neural networks are scrutinized. Therefore, not only the overall system performance, but also the
relative network performances are evaluated.
4.1. Network topologies
4.1.1. Multilayer Perceptron (MLP)
The MLP works as to minimize the difference between expected and real outputs of the system. The MLP topology
designed for this application has the below parameters:
G. Dede, M.H. Sazl / Digital Signal Processing 20 (2010) 763768 765
Table 4.1
Confusion matrix for MLP.
Digit Turkish
writing
Sr Bir

Iki Drt Be s Alti Yedi Sekiz Dokuz %Rec.
rate
0 Sr 16 100
1 Bir 15 1 93.75
2

Iki 16 100
3 16 100
4 Drt 1 15 93.75
5 Be s 16 100
6 Alti 16 100
7 Yedi 16 100
8 Sekiz 16 100
9 Dokuz 16 100
Total 98.75
hidden layer 1: 20 neurons,
hidden layer 2: 20 neurons,
hidden layer 3: 15 neurons.
In all hidden layer neurons, hyperbolic tangent activation function is used. Linear ones are used in output layer neurons.
4.1.2. Elman network
The Elman network, in addition to being a type of recurrent neural network, is basically a two layer back propagated
network. Distinct from other back propagated networks, it has a feedback loop from the output of the rst hidden layer to
that layers input. The Elman network topology designed for this application has the below parameters:
hidden layer 1: 40 neurons,
hidden layer 2: 30 neurons.
In both hidden layers, hyperbolic tangent and linear activation functions are used, respectively. In output layer, logarith-
mic sigmoid activation function is used.
4.1.3. Probabilistic Neural Network (PNN)
The PNN is a network topology that makes use of probability distribution function for calculation of network connection
weights.
In the rst hidden layer, distance from input data to train data is calculated, and in the second hidden layer, these
calculated distances are summed up, producing the resultant output vector. Thus, model classes are obtained. In the output
layer, output of the network is determined as the most probable model class.
The design process for PNN is a bit different than other two network topologies in terms of training; because, in PNN,
weights for inputoutput matches fed to the network are altered by a distribution constant.
The PNN topology designed for this application has the below parameters:
distribution constant: 0.1,
hidden layer 1: 310 neurons,
hidden layer 2: 10 neurons.
4.2. Training and test
Followed by the three network designs, the needs for training and test steps are to be fullled.
In pre-processing, a database of 200 utterances was created. 20% of these utterances are used for training and 80%
are used for testing all three network topologies. The recognition rates for each network are shown in Table 4.1 through
Table 4.3.
According to the recognition rates given in the tables, the most frequently misrecognized Turkish digits are 1 and 4. Both
of these words consist of one syllable only and the modeling process of such one syllable words is more dicult than two
or more syllable words. This result arises from the fact that they are short signals in the spectrogram and so, there are less
samples to model.
The speech recognition system implemented here is based on closed set recognition which means, words used in the
test set are all chosen from a closed set of words the network is trained with. Yet for further investigation of system
performance, it is also tested for open set recognition. In open set recognition, the system is not only tested with words in
the train set but with words other than the ones in train set, as well. Consequently, this system is also tested with some out
of vocabulary words. Words of this additional database are selected such that they have the most similar pronunciation to
766 G. Dede, M.H. Sazl / Digital Signal Processing 20 (2010) 763768
Table 4.2
Confusion matrix for Elman network.
Digit Turkish
writing
Sr Bir

Iki Drt Be Alti Yedi Sekiz Dokuz %Rec.
rate
0 Sr 16 100
1 Sir 15 1 93.75
2

Iki 16 100
3 16 100
4 Doit 16 100
5 Be 16 100
6 Alti 16 100
7 Yedi 16 100
8 Sekiz 16 100
9 Dokuz 16 100
Total 99.375
Table 4.3
Confusion matrix for PNN.
Digit Turkish
writing
Sr Bir

Iki Drt Be Alti Yedi Sekiz Dokuz %Rec.
rate
0 Sr 16 100
1 Bir 16 100
2

Iki 16 100
3 16 100
4 Drt 16 100
5 Be s 16 100
6 Alti 16 100
7 Yedi 16 100
8 Sekiz 16 100
9 Dokuz 16 100
Total 100
Fig. 4.1. Overall recognition rates.
Turkish digits, in order to make the problem harder. They are sr, onbir, kedi, g, drt, ba s, altm s, yetki,
seksen and sekz. When the system is tested with this vocabulary, words are intensively recognized to be undened
and not missed with the digits in either of the three network topologies. In few tests, digits 3 and 6 are mismatched with
words g and altm s by the MLP topology which is a result of both training stochastics and pronounciation problems.
Recognition rates for all networks are shown in Fig. 4.1.
From Fig. 4.1, it is obvious that PNN topology gives the best results so, additional investigation is done.
Firstly, the database is enlarged by the so called leave-one-out method, that is 20 utterances for every one digit are
divided into 5 groups and each group is reorganized into training sets (16 5 =80) resulting in a test set of 80 utterances
for one digit.
Previous tests were applied and the following recognition rates shown in Table 4.4 are obtained. From the confusion ma-
trix, it is deduced that when the database is enlarged and changed, different results are acquired. Yet, successful recognition
accuracy is preserved. The most frequently undened digit seen is 1, but this is an expected situation as one syllable words
are dicult to be recognized by non-phonem based systems.
In order to nd out the affects of database PNN topology, additional tests were done with the accuracy xed at 98.375%
and number of samples in the train set versus number of neurons was plotted in Fig. 4.2.
Tests with 1, 2, 3, 4 and 5 utterances as train data for each digit have resulted in a linear relationship between the
number of samples in the train set and number of neurons used in the network. Tests are ended at 5 utterances for such
reasons that a clear form have occurred in the outputs and the unnecessary increase in computational complexity.
G. Dede, M.H. Sazl / Digital Signal Processing 20 (2010) 763768 767
Table 4.4
Confusion matrix for PNN with larger database.
Digit Turkish
writing
Sr Bir

Iki Drt Be Alti Yedi Sekiz Dokuz %Rec.
rate
0 Sr 79 98.75
1 Bir 74 92.50
2

Iki 77 96.25
3 80 100
4 Drt 80 100
5 Be s 80 100
6 Alti 80 100
7 Yedi 79 98.75
8 Sekiz 78 97.50
9 Dokuz 80 100
Undened 1 6 3 0 0 0 0 1 2 0 98.375
Fig. 4.2. Number of samples vs. neurons.
5. Discussion
Briey, the speech recognition system designed in this project have provided satisfactory results. With the system, a small
vocabulary consisting of Turkish digits have been recognized with high accuracy and some out of vocabulary words, despite
having very similar utterances to digits that have been determined as undened. Recognition accuracy resulted in a range
of minimum 98.125% and maximum 100%, which are high enough to show that neural networks are successful classiers for
speech recognition tasks. Especially, the PNN structure with the highest recognition rates appears to be a more successful
classier than probably the most popular topology, the MLP, for such tasks. In this respect, further investigation on the
number of neurons in PNN topology versus train set was done and a linear relationship was seen to exist between the
number of neurons and train data. That means, once the network is taught with sucient amount of data, there is no
use in neither increasing the train data nor the neurons. Besides its good recognition rates, this characteristic of the PNN
topology outcomes to be another advantage and makes it more stable than the two other ANN types studied here. From this
point of view, this study can be extended to an in depth analysis of PNN structure itself, and its optimization. Additional
suggestions for further work might be the investigation of other ANN structures and new vocabularies, either in Turkish or
in any other language.
6. Conclusion
To conclude, the speech recognition task is a milestone for human beings in transferring their abilities to machines.
Because, despite being ordinary events to understand, speech signals are very sophisticated phenomena for computers
to manage. For this reason, every step in the area of speech recognition is an important development in man-machine
interaction. In previous studies, similar investigations were done on Urdu [7] and Arabic [11] digits, that the authors have
reported recognition rates of 98% and 99.5%, respectively and similar results were obtained when recognition was done
on bit stream [12]. Both [7] and [11] have focused on MLP, where this research not only investigates MLP but Elman
network and PNN as well. Additionally the recognition rates show comparable and even better results that differ in a
range of minimum 98.125% and maximum 100% for different network topologies mentioned above. All results including
the reference studies show that ANN is a proper technique to achieve speech recognition tasks. Finally, resulting from one
aim of ANN which is the transfer of cognitive tasks peculiar to humans, to computers; it is not only a practical but also a
popular research tool as well.
References
[1] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., PrenticeHall, Inc., Englewood Cliffs, NJ, 1999.
768 G. Dede, M.H. Sazl / Digital Signal Processing 20 (2010) 763768
[2] H. Bourobua, M. Bedda, R. Djemili, Isolated words recognition system based on hybrid approach, Informatica 30 (2006) 373384.
[3] T. Kohonen, State of the art in neural computing, in: IEEE First International Conference on Neural Networks, vol. 1, 1987, pp. 7990.
[4] B. Widrow (Ed.), DARPA: Neural Network Study, AFCEA International Press, 1988.
[5] S.A.R. Al-Haddad, S.A. Samad, A. Hussain, K.A. Ishak, Isolated Malay digit recognition using pattern recognition fusion of dynamic time warping and
hidden Markov models, Am. J. Appl. Sci. 5 (6) (2008) 714720.
[6] A. Ahad, A. Fayyaz, T. Mehmood, Speech recognition using multilayer perceptron, in: Proc. of the IEEE Conference ISCON02, vol. 1, 2002, pp. 103109.
[7] S.M Azam, Z.A. Mansoor, M.S. Mughal, S. Mohsin, Urdu spoken digits recognition using classied MFCC and backpropagation neural network, in:
Computer Graphics, Imaging and Visualization Conference, 2007.
[8] L. Rabiner, M. Samber, An algorithm for determining the endpoints of isolated utterances, Bell Syst. Tech. J. 54 (1975) 297315.
[9] C. Marven, G. Ewers, A Simple Approach to Digital Signal Processing, WileyInterscience, New York, 1996.
[10] S.S. Stevens, J. Volkman, E.B. Newman, A scale for the measurement of the psychological magnitude pitch, J. Acoust. Soc. Am. 8 (1937) 185190.
[11] Y.A. Alotaibi, Investigating spoken Arabic digits in speech recognition setting, Inform. Sci. 173 (2005) 113129.
[12] D. Acar, H. Karci, H.G. Ilk, M. Demirekler, Wireless speech recognition using xed point mixed excitation linear prediction (MELP) vocoder, in: Proc. of
the Wireless and Optical Communications Conference, WOC-2002, vol. I, Banff, Canada, 2002, pp. 641644.
Glin Dede was born in 1981 in Ankara, Turkey. She received the M.Sc. degree in electronics engineering from Electronics Engineering
Department, Ankara University in 2008. Her areas of interest include signal processing, neural networks and their applications.
Dr. Murat H. Sazl was born in 1973 in Elaz g, Turkey. He received the B.Sc. and M.Sc. degrees in electronics engineering from Electron-
ics Engineering Department, Ankara University, with high honors in 1994 and 1997, respectively. He received the Ph.D. degree in electrical
engineering from Syracuse University in 2003. He was the recipient of Outstanding Teaching Assistant Award from Syracuse University
in 2002. He is currently an Assistant Professor and Vice Chairman of Electronics Engineering Department of Ankara University. His areas
of interest include turbo coding and decoding, neural networks and their applications, wireless communications.

Das könnte Ihnen auch gefallen