Beruflich Dokumente
Kultur Dokumente
Abstract
This paper describes the frequency analysis of spoken Urdu Unconsciously, the correlation is used every day life. When
numbers from ‘sifr’ (zero) to ‘nau’ (nine). Sound samples one looks at a person, car or house, one’s brain tries to match
from multiple speakers were utilized to extract different the incoming image with hundreds (or thousands) of images
features. Initial processing of data, i.e., normalizing and time- that are already stored in memory [2]. We based our current
slicing was done using a combination of Simulink and work on the premise that same word spoken by different
MATLAB. Afterwards, the same tools were used for speakers is correlated in frequency domain.
calculation of Fourier descriptions and correlations. The
correlation allowed comparison of the same words spoken by In the speech recognition research literature, no work has
the same and different speakers. The analysis presented in this been reported on Urdu speech processing. So we consider our
paper is seen as the first step in creating an Urdu speech work to be the first such attempt in this direction. The
recognition system. Such a system can be potentially utilized analysis has been limited to number recognition. The process
in implementation of a voice-driven help setup at call centers involves extraction of some distinct characteristics of
of commercial organizations operating in Pakistan/India individual words by utilizing discrete (Fourier) transforms and
region. their correlations. The system is speaker-independent and is
moderately tolerant to background noise.
Keywords: Spoken Urdu number processing, Fourier
descriptors, Correlation, Speaker independent system, Feature 2. REVIEW OF DISCRETE TRANSFORMATION &
extraction, Simulation. ITS MATLAB IMPLEMENTATION
φ ( k ) = tan −1 [I ( k ) / R( k )] (2.2)
The results of the following commands with N=4 can be
easily verified with:
where X(k) is understood to represent X(kω). These equations
are therefore analogous to those for the Fourier transform. x=[1 0 0 1];
Note that N real data values (in the time domain) transform to X=fft(x)
N complex DFT values (in frequency domain). The DFT
values, X(k), are given by: In this example, the DFT components [Xm]=[2, 1-j, 0, 1+j]
are found from (2.4). The second expression in (2.6) specifies
N−1
FD [x (nT)] = ∑x(nT)e− jkωnT, k = 0,1,...,N − 1 (2.3) the value of N in (2.4), which effectively overrides any
n=0
previous specification of the length of the vector x. thus, the
following commands produce the same result:
where ω= 2π / NT and FD denotes the DFT.
x=[1 0 0 1 3];
X=fft(x,4)
N−1
X[k] = ∑ x (nT ) e− jk2π nT / NT
n=0 The DFT, x= [Xm] has length = 4 is the same as in previous
example.
N −1
or X [k ] = ∑ x (n T ) e
n =0
− jk 2 π n / N (2.4) x=[1 1];
X=fft(x, 4)
N −1
rxy = ∑x
n =0
(n) y (n) (2.5)
[Xm] =[ 2, 1-j, 0, 1+j]
The result here is the same because, when N is greater than the
The Fast Fourier transform (FFT) eliminates most of the
length of x; X is the DFT of a vector consisting of x extended
repeated complex products in DFT. In C version of signal
with zeros on the right, from the length of x to N. (The length
processing algorithm, there are several different routines for
real and complex versions of the DFT and FFT. When these of the vector x itself is not increased in the process). The
routines are coded into the MATLAB language, they are very MATLAB library also includes a two dimensional fft routine
slow compared with the MATLAB fft routine, which are called fft2. The routine computes the two-dimensional FFT of
coded much more efficiently. Furthermore, the MATLAB any matrix, whose element may be, for example, samples
routines are flexible and may be used to transform real or (pixel values) of a two dimensional image.
complex vector of arbitrary length. They meet the
requirements of nearly all signal processing applications; Usually, some recognition occurs when the incoming images
consequently, in this paper, the fft routines are preferred over bears a strong correlation with an image in memory that
all discrete transform operations. “best” corresponds to fit or is most similar to it. This process
also helps one distinguish between say, a dog and a cat, a rose
MATLAB’s fft routine produces a one-dimensional DFT and sunflower, or a train and an airplane. A similar approach
using the FFT algorithm; that is when [xK] is a real sequence, is used in this investigation, to measure the similarity between
fft produces the complex DFT sequence [Xm]. In MATLAB, two signals. This process is known as autocorrelation if the
the length N of the vector [xK] may not be given. Thus both of two signals are exactly the same and as cross-correlation if
the following are legal expressions: the two signals are different. Since correlation measures the
similarity between two signals, it is quite useful in identifying
X=fft(x) (2.6)
a signal by comparing it with a set of known reference signals.
X=fft(x, N)
The reference signal that results in the lowest value of the
correlation with the unknown signals is most likely the
The first expression in (2.6) produces a DFT with the same
identity of the unknown object.
number of elements as in [xK], regardless of whether [xK] is
real or complex. In the usual case where [xK] is real and length
Correlation involves shifting, multiplication and addition
N, the last N/2 complex elements of the DFT are conjugates of
(accumulation). The cross-correlation function (CCF) is a
the first N/2 elements in the reverse order, in accordance with
measure of the similarities or shared properties between two
(2.4). In the unusual case where [xK] is complex, the DFT
signals. Application of CCF includes cross spectral density,
detection and recovery of signals buried in noise, for example however, most of them failed to provide us any useful insight
the detection return signals, pattern, and delay measurement. into the data characteristics. (These are not discussed further
The general formula for cross-correlation rxY(n) between two for the sake of brevity).
data sequences x(n) and y(n) each containing N data might
therefore be written as: We would also like to mention that we had started our
experiments by using Simulink, but soon found this GUI-
N −1 based tool to be somewhat limited because we did not find it
rxx = ∑ x ( n) x ( n)
n =0
(2.8) easy to create multiple models containing variations among
them. This iterative and variable-nature of models eventually
The autocorrelation function (ACF) involves only one signal led us to MATLAB’s (text-based) .m files. We created these
and provides information about the structure of the signal or files semi-automatically by using a PERL-language script;
its behaviour in the time domain. It is special form of CCF and the script was developed specifically for this purpose.
is used in similar applications. It is particularly useful in
identifying hidden properties. Three main data pre-processing steps were required before
the data could be used for analysis:
3. DATA ACQUISITION AND PROCESSING
3.1 Pre-Emphasis
One of the obvious methods of speech data acquisition is to
have a person speak into an audio device such as microphone By pre-emphasis [5], we imply the application of a
or telephone. This act of speaking produces a sound pressure normalization technique, which is performed by dividing the
wave that forms an acoustic signal. The microphone or speech data vector by its highest magnitude.
telephone receives the acoustic signal and converts it into an
analog signal that can be understood by an electronic system. 3.2 Data Length Adjustment
Finally, in order to store the analog signal on a computer, it
must be converted to a digital signal. FFT execution time depends on exact number of the samples
(N) in the data sequence [xK], and that the execution time is
The data in this paper is acquired by speaking Urdu numbers minimal and proportional to N*log2(N), where N is a power
into a microphone connected to MS-Windows-XP based PC. of two. Therefore, it is often useful to choose the data length
The data is saved into ‘.wav’ format files. The sound files are equal to a power of two.
processed after passing through a (Simulink) filter, and are
saved for further analysis. We recorded the data for fifteen 3.3 Endpoint Detection
speakers who spoke the same number set, i.e. zero to nine.
The sound sample was curtailed for 0.9 minutes. The goal of endpoint detection is to isolate the word to be
detected from the background noise. It is necessary to trim
In general, the digitized speech waveform has a high dynamic the word utterance to its tightest limits, in order to avoid
range, and can suffer from additive noise. So first, a Simulink errors in the modeling of subsequent utterances of the same
model was used to extract and analyze the acquired data; see word. As we can see from the upper part of Fig. 3, a
Fig. 1. threshold has been applied at both ends of the waveform. The
front threshold is normalized to a value that all the spoken
FDATool numbers trim to a maximum value. These values were
From Wave File
s1_w1.wav Out obtained after observing the behavior of the waveform and
(22050Hz/1Ch/16b)
noise in a particular environment. We can see the difference
One Digital To Wave
S-Function Filter Design1 Device
in frequency characteristics of the words aik (one), teen
yout (three), chaar (four) and paanch (five) in Fig. 3, 4, 5 and 6,
respectively.
Signal To
Workspace
3.4 Windowing
The Simulink model, as shown in Fig. 2, was developed for 3.5 Frame Blocking
performing analysis such as standard deviation, mean,
median, autocorrelation, magnitude of FFT, data matrix Since the vocal tract moves mechanically slowly, speech can
correlation. We also tried a few other statistical techniques, be assumed to be a random process with slowly varying
properties. Hence the speech is divided into overlapping
FDATool
frames of 100 ms. The speech signal is assumed to be From Wave File yout2
stationary over each frame and this property will prove useful s10_w1.wav Out
Signal To
in further operations [5], [6], [8]. (22050Hz/1Ch/16b)
Workspace2
One Digital
3.6 Fourier Transform S-Function Filter Design1 To Wave
|FFT|2
From Wave File Device2
The MATLAB algorithm for the two dimensional FFT s10_w2.wav Out Magnitude Freq Autocorr 0.00000000000000
A
routine is as follows [9]: (22050Hz/1Ch/16b) FFT2 Vector LPC
Scope2
Two Autocorrelation Display5
fft2(x) =fft(fft(x), ‘); S-Function LPC
From Wave File
Thus the two dimensional FFT is computed by first s10_w3.wav Out Time
Horiz Cat
computing the FFT of x, that is, the FFT of each column of x, (22050Hz/1Ch/16b) To Vector
|FFT|2 Frame
yout1
and then computing the FFT of each row of the result. Note Scope
Three
that as the application of fft2 command produced even Magnitude Signal To
Buffer Frame Conversion Matrix
symmetric data, we only show the lower half of the From Wave File FFT1 Concatenation Workspace1
frequency spectrum in our graphs. s10_w4.wav Out
0.0000000
(22050Hz/1Ch/16b)
3.7 Correlation Four Mean Display
Ten
SPEAKER ONE: s1 w0
1800
1600
1400
1200
1000
800
600
400
200
0
0 2000 4000 6000 8000 10000 12000
Fig. 4 The waveform of the correlation of the spoken Urdu number
teen (three)
Fig. 7 The waveform of the correlation of the spoken Urdu number sifr
(zero) by speaker-1
SPEAKER TWO: s2 w0
600
500
400
300
200
100
SPEAKER THREE:s3 w0
1800
1600
1400
1200
1000
800
600
400
200
Fig. 6 The waveform of the correlation of the spoken Urdu number
paanch (five) 0
0 2000 4000 6000 8000 10000 12000
. Fig. 9 The waveform of the correlation of the spoken Urdu number sifr
(zero) by speaker-3
SPEAKER ONE: s1 w4 SPEAKERS 15 NUMBER 1 mesh
900
800
1
700 0.9
0.8
600 0.7
0.6
500 0.5
0.4
400 0.3
0.2
300 0.1
0
15
200
15
10
100 10
5
5
0 0 0
0 2000 4000 6000 8000 10000 12000
Fig. 10 The waveform of the correlation of the spoken Urdu number Fig. 14 The mesh plot of the correlation of the spoken Urdu number
spoken chaar (four) by speaker-1 spoken aik (one) by speaker-15
SPEAKER ONE: s1 w5
1200 6. REFERENCES
1000
[1]. S K Hasnain, Nighat Jamil, “Implementation of Digital
800 Signal Processing real time Concepts Using Code
Composer Studio 3.1, TI DSK TMS 320C6713 and DSP
600
Simulink Blocksets,” IC-4 conference, Pakistan Navy
400 Engineering College, Karachi, Nov. 2007
200
[2]. S K Hasnain, Pervez Akhter, “Digital Signal Processing,
Theory and Worked Examples”,January 2007.
0
0 2000 4000 6000 8000 10000 12000
10
15 [6]. J Koolwaaij, “Speech Processing,” //www.google.com/
10
5
5
search (current 5 May 2004).
0 0
Fig. 12 The surface plot of the correlation of the spoken Urdu numbers [7]. M A Al-Alaoui, R Mouci, M M Mansour, R Ferzli, “A
spoken aik (one) by speaker-15 Cloning Approach to Classifier Training,” IEEE Trans.
Systems, Man and Cybernetics – Part A: Systems and
SPEAKERS 15 WORD 3 surface Humans, vol. 32, no. 6, pp. 746-752, 2002.
0.8
[8]. “TMS320C6713 DSK User’s Guide,” Texas Instruments
0.6 Inc., 2005.
0.4
10
10
15
[10]. Samuel D Stearns, Ruth A David, “Signal Processing
5
5 Algorithms in MATLAB,” Prentice Hall, 1996.
0 0
Fig. 13 The surface plot of the correlation of the spoken Urdu numbers
spoken teen (three) by speaker-15