Sie sind auf Seite 1von 4

2011 International Conference on Intelligent Computation and Bio-Medical Instrumentation

Speech Signal Feature Extraction Based on Wavelet Transform

Xiaolan Zhao

Zuguo Wu, Jiren Xu, Keren Wang, Jihai Niu

School of Mechanicaland Automotive Engineering

Hefei University of Technology
Hefei, China

Department of inforamtion
Eletric engineering Institute of Hefei

AbstractAnalysis of the voice pronunciation mechanism and

performance differences of normal voice in the frequency
domain, wavelet transform is used to do signal decomposition,
and emphasizing characteristics of voice, with these two
characteristic parameters we recognize 242 normal voice using
gaussian mixture model (GMM) respectively. Put forward
wavelet de-noising, entropy coefficient of decomposition (ECD)
as the characteristic vector sets of recognition based on the
analysis of the multi-scale. Through wavelet transform for the
voice goal signal after wavelet packet decomposition, we take
the energy character of frequency band as a feature vector.
Experiments show that wavelet transform can improve the
frequency characteristics of signal, and compress the dimension
of characteristics space, and it has very good classification
effect of speech signal.

different frequency range, somebody put forward wavelet denoising based on multi-scale analysis, entropy coefficient of
decomposition(ECD), and demenstrate the superiority of
ECD in voice recognition compared with the traditional
features by experiments. In recent years, HMM is widely
used in speech recognition, and neural networks have also
been used for voice assessment, continuous type of gaussian
mixture model (GMM) which statenumber is 1 is also used
widely. Because many gaussian density function can be
contained in a state, and don't exist state transition probability,
so GMM are much smaller than HMM in the calculation
quantity. We use GMM to recognize voice and obtain MFCC
Feature parameters which is used widely in the speech
recognition and proposed ECDDMA parameters, and
compare their recognition result[1-4].

Keywords-speech signal, wavelet transform,

extraction, gaussian mixture model (GMM)




Experimental data is from sample cases, when data is

sampled, indoor environment must be quiet; Sampling
frequency is 16 kHz, and time is from 1.5 s to 3 s; The tested
sample is Chinese vowel sound 'a', and the voice samples of
normal person are obtained. Normal compared group has 242
examples, and age is from 18 to 40 years old, the average age
is 25 years old. We use cooledit software to segment voice
after the collection, and obtain the database of speech voice[5].


The various physical structure characteristics of vocal

cord decide function of vibration and occlusion, and
determine the acoustic properties of throat sound source, and
thus appear different sounds. At present domestic laryngeal
functions inspection method which is used relatively
commonly In the domestic is to use computer technology,
using Dr. Speeeh software to analyse several acoustic
parameters of individual and other voices, and at the same
time detect various acoustic parameters of basic frequency,
frequency perturbation, amplitude perturbation, normalized
noise energy etc further in combination with acoustic figure,
but there are certain limitations in the effective detection of
voice. But the method of computer voice recognition has
very vital significance.
Speech signal is a nonlinear and non-stationary signal, the
method of obtaining characteristics is realized through
Fourier transform with windows, the biggest weakness of
transformation is not to improve the resolution of time and
the frequency at the same time, but wavelet transform can be
very good to overcome the shortcomings, and it can adjust
time-frequency window flexibly, and improve the resolution
of time and frequency at the same time. In the traditional
method of computer voice recognition, because MFCC can
make use of the special perception characteristics of human
ear fully, it is used widely, but some think that the human
ears use wavelet transform in terms of original voice
recognition, in combination with difference of voice in
978-0-7695-4623-0/11 $26.00 2011 IEEE
DOI 10.1109/ICBMI.2011.80




Gaussian mixture model is a linear combination of

gaussian probability density function, as long as there is
mixed component of sufficient number, it can approximate
any kind of density function. A probability density function
of M layer mixed gaussian model is obtained by summing M
gaussian probability density functions whose powers are
different, and its is shown as following[6]:

P( X | O )

Z b ( X )
i i


i 1

Among them, X is a random vector of D dimesnsion,

bi ( X i ) ,i=1,2,, M is sub-distribution, Zi ,i=1,2,, M is
mixed weight. Each sub-distribution is joint gaussian
probability distribution of D dimension, and is can be
expressed as:


N ( xi , P i , i )

J tc (k )

exp[ ( X  P i ) t i1 ( X  P i )]
(2S ) D | i |



represents the mean vector of density

function, and i represents the covariance matrix of the

density function, and mixed weight must meet:


The parameters of complete gaussian mixture model are

expressed as:
O {Z i , P i , i }, i 1,2,..., M
The most common parameters estimation method of
parameter estimation of GMM model is ML
estimation(Maximum Likelihood estimation). For a sequence
of training vector whose length is T X {X t } ,t=1,2,,T,
the likelihood of GMM of can be expressed as:

P( X

P( X | O )

| O)


t 1

Because above formula is nonlinear function of parameter

, it is difficult to directly get the maximum of above
formula. Therefore, EM (Expection Maximization) is often
used to estimate the parameter . For many observed
sequence, iterative ML revaluation formula is shown as:



c 1 t 1
c Tc

N (x


(k )

, Pk , k )





(k )

k 1 c 1 t 1




c 1 t 1
c Tc

(k) xtc


(k )

c 1 t 1



(k )( xtc  P k ) 2

c 1 t 1




A. Multi-resolution Analysis
Meyer constructed a smooth function which has a certain
attenuation creatively in 1986, and its binary shrinkage and
translation constitute standard orthogonal basis of L2(R)
which denotes square integral real space, signal space of
energy limited namely, just it makes wavelet to get
development really. In 1988 the concept of multi-resolution
analysis is put forward on the basis of orthogonal wavelet,
and it illustrates the wavelet multi-resolution characteristics
from the concept of space, which will unify structure method
of all previous orthogonal wavelet, and give the method of
constructing orthogonal wavelet and the fast algorithm based
on orthogonal wavelet transform, namely Mallat algorithm.
The position of Mallat algorithm in wavelet analysis is
compared to the position of of the algorithm of fast Fourier
transform in the classic Fourier analysis[7-8].
When understanding the analysis of multi-resolution, we
can use figure 1 to analyse:
We can see from Figure 1, multi-resolution analysis is
further decomposition to low frequency part, and high
frequency part is not considered. The relationship of
decomposition is S=A3 D3 D2 D1. If we want to
further decompose it, can decompose the low frequency part
A3 into the low frequency part A4 and the high frequency
part D4 again, and downward decomposition can proceed
according to the above measure.

k 1

Among them,

N ( xtc , P k , k )

(k )

c 1 t 1

Figure 1. The structure of three layer multi-resolution analysis tree

Among them, c is the number of observed sequence, and

Tc is the length of of No. c observed sequence of model, and

J tc (k )

is the probability of No. k mixed ingredients of No. c

observed sequence in time t:


B. Target feature Extraction from Wavelet Packet Tree to

Extracting Wavelet Tree
We adopt wpdec function based on Matlab wavelet
analysis toolbox to take wavelet packet decomposition, and
the called format of the function is:
T = wpdec (X, N, 'wname')
The format denotes N layer wavelet packet
decomposition to signal X according to the wavelet function
wname, and returns wavelet packet decomposition structure
[T, D] (T is the tree structure, D is data structure). In the
target feature extraction procedure we use the wavelet
function ,namely Symlets function, and it usually is
expressed as symN (N=2,...,8), and taked 8 class wavelet
packet decomposition.
The function of extracting wavelet tree is wp2wtree from
the wave packet tree, and its called format is T= wp2wtree
(T), and modify the structure of calculating tree based on
wavelet packet decomposition tree,, and return wavelet tree T.
Speech target signal is launched according to category,
and among them, one group is the mixture of narrowband
noise whose center frequency is 3 kHz and 3 kHz signal.
For the above type of speech target signal to take wavelet
decomposition and extract the wavelet tree, the original
signal waveform and the multistage decomposition waveform
is shown as in figure 3 and figure 4.

From the figure 4 we can find the decomposition of

wavelet tree at all levels can decompose target signal from
different frequency bands effectively, in the following we use
the decomposed coefficient at all levels of target wavelet tree
to extract signal characters of some target. Procedure is as
(1) At first take multi-scale wavelet decomposition for the
signal, and extract signal characteristics of each component
from high frequency bands to low frequency bands;
(2)Reconstrcut the decomposition coefficient of wavelet
tree at all levels, extract the signal within the scope of
frequency bands. Suppose Xj (j = 1, 2, , 9) to denote the
reconstruction signal of s80, s81, s71,s61s51s41s31
s21, s11 respectively, the original signal S can be expressed
S= X1 X2 X2 X3 X4 X5 X6 X7
X8 X9
Assume the lowest frequency component is 0 and the
highest frequency components is 1 in the signal S.
(3) Sum total energy of each frequency band signal Ej (j =
1, 2,, 9). Among them define: E j

(k ) (Xj

k 1

(K) denotes discrete amplitude value of reconstruction signal

Xj, K = 1, 2,..., N; and N denotes the number of discrete
points of reconstruction signal).
(4)Construct characteristics vector. Take the total energy
of every frequency band of signal as elements, and define:


. The construction of characteristics vector F

j 1

is shown below:

E1 E 2 E3 E 4 E5 E 6 E 7 E8 E9
, ,
, ,
, , ]

From the above extraction process of characteristic vector

we can fond clearly, the feature vector can reflect the
characteristics information of target from the angle of energy
distribution, and the dimension of characteristic vector is 9.
The extracted feature vectors is shown in figure 4.

Figure 2. The various levels of decomposition signal waveform of a target




This paper researchs feature extraction method of speech

signal based on wavelet transformation. The experiment
shows that this method can make the speech signal
characteristics to display in the different subspace of different
resolution, and the energy signal has a better separability of
category than the original signal.

Figure 3. The character vector of a target signal


CHENG Hao,L Ming,and TANG Bin etc, DSSS Signal Detection

SUN Zhao-qiang,LU Yao-bing, and LI Bao-zhu etc, Research on
extraction technology of micro-Doppler of wideband signal and its
properties, Systems Engineering and Electronics,2008,(11)20402044.







Li Nan,Qu Changwen, and Ping Dianfa etc, Emitter recognition

arithmetic based on fractal theory, Aerospace Electronic
XU Zhen-hua,ZHANG Lai-bin, and DUAN Li-xiang, The
Application of Time and Frequency Domain Analysis In Piston
Abrasion Fault Diagnosis of Reciprocating Compressor, Compressor
Man Lingbin and Chen Chuan, Research on Passive Localization of
Underwater Target Based on Cepstrum, Mine Warfare & Ship SelfDefence,2010,(1)49-5256.
WANG Yujie,GUO Li, and WANG Cuiping, An audio steganalysis
method for echo hiding based on statistical features of power
cepstrums, Journal of University of Science and Technology of
LONG Liang-jiang,ZHANG Ge-xiang,and TIAN Bo etc, A fast
algorithm for extracting main ridge slice of ambiguity function of
LFM signal, Journal of Chongqing University of Posts and
Li Ping,Tang Hong, and Li Zhe, Analysis of the intrapulse
characteristic of phase-coded signal based on phase restoral,
Aerospace Electronic Warfare,2010,(3)30-33.