Sie sind auf Seite 1von 4

Vowels Recognition Using Mel-Frequency

Cepstral Coefficient

Sathees Kumar
1
Paulraj M P
1
Sazali Bin Yaacob
1
Ahamad Nazri
2

1
School of Mechatronic Engineering, Universiti Malaysia Perlis, Perlis, Malaysia.
2
Centre for Communication Skills & Entrepreneurship, University Malaysia Perlis, Perlis, Malaysia.
Email: satheesjuly4@gmail.com


ABSTRACT- The English language as spoken by
Malaysians varies from place to place and differs
from one ethnic community and its sub-group to
another. Hence, it is necessary to develop an exclusive
speech-to-text translation system for understanding
the English pronunciation as spoken by Malaysians.
Speech translation is a process of both speech
recognition and equivalent phonemic to word
translation. Speech recognition is a process of
identifying phonemes from the speech segment. In
this paper, the initial step for speech recognition by
identifying the phoneme features is proposed. In
order to classify the phoneme features, Mel-frequency
cepstral coefficients (MFCC) are computed in this
paper. A simple feed forward Neural Network
(FFNN) trained by back propagation procedure is
proposed for identifying the phonemic features. The
extracted MFCC coefficients are used as input to a
neural network classifier for associating it to one of
the 11 classes.
Keywords: Speech to text translation, Digital signal
processing, Phonemes, Mel-frequency cepstral
coefficients

1. INTRODUCTION

Speech is one of the natural forms of
communication. Speech Translation is a process of
both speech recognition and equivalent phonemic
to word translation [10]. Speech recognition is a
process of identifying phonemes from the speech
segment. Translation means interpreting the
phonemes as a possible word combination to
establish a speech model of the input speech
segment. In recent years, there has been an
increasing research in the design of Automatic
Speech-to-Text Conversion System by various
research groups.

Speech signals are composed of a sequence of
sounds. The classification of speech sounds is
called phonetics. Figure.1.1 shows the sounds of
classification of English phonemes of American
English. The four broad classes of sounds are
vowels, diphthongs, semivowels, and consonants.
Each of these classes can be further classified into
sub-classes [4].




















Figure.1.1 Phoneme Classes

Vowels are more difficult to describe accurately
than consonants. This is largely because there is no
noticeable obstruction in the vocal tract during their
production. The only reliable way of observing
vowel production is using an X-ray photography.
This is not only expensive but also dangerous.
In this work the vowel sounds are classified into
11 classes [12] depending upon the positions of
the tongue, tongue tension and front, central and
back positions of lips as shown in Table 1.1 .

Table 1.1 Vowel Classifications
Tongue
positions
Tongue
tension
Front
position
Central
position
Back
position
High Tense
Relaxed
beet
/iy/
bit
/ih/
-
-
boot
/uv/
book
/uh/
Mid Tense
Relaxed
bait
/ey/
bet
/eh/
-
but
/ah/
boat
/ow/
bought
/ao/
Low Not
applicable
bat
/ae/
pot /aa/


Since many years, the two most common and
successful approaches for speaker recognition are
based on modelling the speech by Gaussian
Mixture Models and Hidden Markov Models [6]
[7]. These methods are attractive for their phonetic
Diphthongs
Phonemes

Semi vowels
Vowels
Front Mid Back
Liquids Glides
Nasals
Consonants
Stops
Fricatives
Whisper
Affricative
s
Voiced Unvoiced
Voiced Unvoiced
discrimination capacity [8]. In this research work
the MFCC [9] extracted from the speaker
phonemes act as discriminative features. The
MFCC technique makes use of two types of filter,
namely, linearly spaced filters and logarithmically
spaced filters. The main advantage of MFCC
feature extraction is that it uses Mel frequency
scaling which is very similar to the human auditory
system. The FFNN has been the subject of
intensive research because of its intensive learning
and generalization properties and its applicability to
classification, approximation and control problems.

Any FFNN used for classification,
approximation and control problems have hidden
and output neurons activated by standard sigmoidal
functions that may exhibit prolonged training time
processing and more number of oscillations in
output error against epoch characteristics.

In this paper, a feed forward neural network
model is developed for classifying the vowel
sounds using the MFCC features.

2. FEATURE EXTRACTION USING MEL-FREQUENCY
CEPSTRUM COEFFICIENTS

In this work, speech signals or vowel sounds as
shown in Table 1.1 are recorded from 10
individuals at a sampling rate of 16 KHz. This
sampling frequency was chosen to minimize the
effects of aliasing in the analog-to-digital
conversion [13]. The MFCC features are extracted
from the recorded speech signals. The basic block
diagram of the MFCC feature extraction is given in
Figure.3.1. The main purpose of the MFCC
processor is to mimic the behaviour of the human
ears. In addition, rather than the speech waveforms
themselves, MFFCs are shown to be less
susceptible to mentioned variations.















Figure.2.1. Block diagram of the MFCC




2.1. FRAME BLOCKING
The continuous speech signal is blocked into
frames of N samples, with adjacent frames being
separated by M (M < N). The first frame consists of
the first N samples. The second frame begins M
samples after the first frame, and overlaps it by N -
M samples and so on. This process continues until
all the speech is accounted for within one or more
frames. Typical values for N and M are N =256
and M =100.

2.2 WINDOWING

The purpose of windowing here is to minimize
the signal discontinuities at the beginning and end
of each frame and to minimize the spectral
distortion by using the window to taper the signal
to zero at the beginning and end of each frame.
Where N is the number of samples in each frame,
then the result of windowing is the signal



Typically the Hamming window is used, which
has the form [13]



2.3 FAST FOURIER TRANSFORM (FFT)

After applying the hamming window, FFT is
applied to convert time domain into frequency
domain. The FFT is a fast algorithm to implement
the Discrete Fourier Transform (DFT), which is
defined on the set of N samples {xn}, as follows:



2.4 MEL-FREQUENCY WRAPPING [13]

The frequency content of sounds for speech
signals does not follow a linear scale. Thus, for
each tone with an actual frequency, f, measured in
Hz, a subjective pitch is measured on a scale called
the Mel scale. The Mel-frequency scale is linear
frequency spacing below 1000 Hz and a
logarithmic spacing above 1000 Hz.
The number of mel spectrum coefficients, K,
is typically chosen as 20.

2.5 CEPSTRUM [13]

The log mel spectrum is again converted into
time. The result is called the mel frequency
cepstrum coefficients (MFCC). The cepstral
representation of the speech spectrum provides a
good representation of the local spectral properties
of the signal for the given frame analysis. Because
Mel-Frequency
Wrapping

Mel
Spectrum
Frame
Spectrum

FFT
Windowing
Framing
Continuous
speech
Mel-Frequency
Wrapping

Mel
cepstrum
the mel spectrum coefficients are real numbers, we
can convert them to the time domain using the
Discrete Cosine Transform (DCT). Thus the MFCC
can be derived as


By applying the procedure described above, for
each speech frame of 30msec with overlap, a set of
Mel-frequency cepstrum coefficients is computed.
For each speech signal DCT coefficients are
extracted and used as a feature set model in the
neural network.

3. THREE LAYER FFNN

An FFNN consists of three layers, namely,
input layer, hidden layer and output layer as shown
in Fig.3.1. A supervised learning method is
employed in the FFNN, in which the calculated
output is compared with the target value and then
the error is fed back to the subsequent layers to
modify the weights. The usage of bias neurons
enhances the convergence process. The specified
parameters such as learning rate and momentum
factor control the change in weight. A tolerance
level is fixed as the stopping condition for the
training process.

The hidden and outputs layers have binary
sigmoidal activation functions, which are the
activations of the corresponding neurons. The
selection of an activation function for a specific
application is a major criterion in achieving good
performance of the BP algorithm.

















Fig.3.1 Three - layer feed forward neural network

4. RESULTS AND DISCUSSION

In the experimental study, the 11 classes of
voice samples classification are obtained from 10
individuals. MFCC features are extracted from the
recorded sound waves. These coefficients are then
used as sample input pattern to the neural network.
In this work a three-layer FFNN with 20 input
neurons, 30 neurons in the hidden layer and 11
output neurons is considered. The hidden and
output neurons are activated by the sigmoidal
activation function as given in the equation below:

f(x) =1/[1+exp
-x
]
-1


where x is the net input to the neuron.

The initial weights are randomized between
0.5 and +0.5 and normalized. The FFNN is trained
with the BP algorithm. The network is trained with
60%, 65%, 70%, 75%, and 80% of the total
samples and tested with total testing samples. The
resulting mean square error (mse) versus epoch
graph is shown in Fig.4.1.



Fig.4.1. Mean square error Vs epoch for
classification of vowel sounds

The epoch, network training parameters and the
mean classification rate are shown in Table 5.1

TABLE 5.1 NEURAL NETWORK TRAINING RESULTS

Activation function [1+exp -x]
-1

Testing tolerance 0.1
Training tolerance 0.03
Number of input neurons 20
Number of hidden neurons 30
Number of output neurons 11
Number of epoch 600
Percentage
Of Training
Samples
(440)
Mean
epoch
Mean
classification
rate
Training
time (sec)
60% 60 86.3 68
65% 31 85.78 32
70% 39 87.14 41
75% 22 88.75 25
80% 48 90.41 58



W0k


Wjm
Wpk
Wpm
W1m
Wjk
Wp1
Vij
Vl
j
V1p
Vnj
V
nj
W0m
W1k
W01
W11
Wj1
Ym
V0
p
Z1
V0j
Zj Zp
Xn
V0
1
1
Xi
1
X1
Y1
Yk
V1
1
Vi
1
Vi
p

Vnp


5. CONCLUSION
In this paper, the voice samples were recorded
from 10 individuals and were classified into 11
classes depending upon the positions of the tongue,
tongue tension and front, central and back positions
of lips. MFCC feature extraction method is used
and features are extracted. A simple FFNN based
classifier is developed, with the average
classification accuracy of the neural network based
classifier of between 85.78% and 90.41%.

REFERENCES

[1] B.-H. hang and S. Furui: Automatic recognition and
understanding of spoken language - A first step towards natural
human-machine communication, Proc. IEEE, 88,8, pp. 1142-
1 165,2000.
[2] Tomi Kinnunen, Spectral Features for Automatic Text-
Independent Speaker Recognition,
University of Joensuu, Department of Computer Science,
Joensuu, Finland, December 21,
2003.
ftp://cs.joensuu.fi/pub/PhLic/2004_PhLic_Kinnunen_Tomi.pdf
[3] L.R. Rabiner and B.H. J uang, Fundamentals of Speech
Recognition, Prentice-Hall, Englewood Cliffs, N.J ., 1993.
[4] L.R Rabiner and R.W. Schafer, Digital Processing of Speech
Signals, Prentice-Hall, Englewood Cliffs, N.J ., 1978.













































[5] J os Ramn Calvo de Lara, A Method of Automatic
Speaker Recognition Using
Cepstral Features and Vectorial Quantization, Advanced
Technologies Application Center, CENATAV, Cuba
[6] J oseph P. Campbell, J r, Speaker Recognition: A tutorial,
DoD. Proceedings of the IEEE,
Vol 85, No. 9 September 1997, pp. 1437-1462.
[6] D. E. Sturim, D. A. Reynolds, R. B. Dunn, and T.F. Quatieri,
"Speaker Verification using Text-Constrained
Gaussian Mixture Models," Proc. of IEEE ICASSP, May 2002,
vol. 1, p. 677-680.
[7] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, "Speaker
Verification Using Adapted Mixture Models", Digital Signal
Processing, vol. 10, pp 181 -202, 2000.
[8] B. J acob, "Automatic speech recognition", Doctorat, Paul
Sabatier university, Toulouse, September 2003.
[9] H. Matsumoto, "Evaluation of mel-lpc cepstrumin a long
vocabulary continuous speech recognition", Proc. IEEE ICASSP
2001.
[10] Bertil Lyberq, Method and arrangement for speech to text
conversion, May 1998.
[11] Comp. speech Frequently Asked Questions WWW site,
http://svr-www.eng.cam.ac.uk/comp.speech/
[12] J ohn Laver, Principles Of Phonetics, Cambridge University
Press, Great Britain, 1994
[13] An automatic speaker recognition system
http://www.ifp.uiuc.edu/~minhdo/teaching/speaker_recognition/
speaker_recognition.html

Das könnte Ihnen auch gefallen