Sie sind auf Seite 1von 5

2011 International Conference on Multimedia and Signal Processing

Adaptive Speech Enhancement Based on Classification of Voiced/Unvoiced Signal


Zhang Yi
Zhang Jun-chang
Ye Zhen
School of Electronics and Information School of Electronics and Information School of Electronics and Information
Northwestern Polytechnical University Northwestern Polytechnical University Northwestern Polytechnical University
Xian, China
Xian, China
Xian, China
Email: nwpuzhangyi@gmail.com
Email: nwpuzjc@yahoo.com.cn
Email: angelyon@sohu.com
AbstractAt low SNRs, the conventional wavelet
enhancement algorithm may lose some useful ingredients of
the speech, resulting in a low enhancement performance. To
solve the problem, this paper presents an improved wavelet
enhancement algorithm based on the classification of the
voiced/unvoiced signal. First, we eliminate part of the noise
through spectral subtraction algorithm and separate voiced
signal from unvoiced signal according to its short-time
energy. Second, the Wavelet Packet Transform (WPT) is
made on unvoiced speech to prevent signal distortion. Then
dynamic thresholds are applied to different wavelet
analytical scales to avoid a much smoother signal waveform.
Finally, a new adaptive threshold function is used to make up
the disadvantages of soft and hard thresholding algorithm.
Experiments show that the proposed method can remove
much noise while keeping intelligibility of the reconstructed
speech.

speech signal and eliminating non-stationary noise,


especially in a low SNR environment. Since then, various
wavelet de-noising algorithms have been studied,
primarily focusing on adaptive noise estimate and
dynamic threshold adjuster to track the changing of local
SNR. Akhaee, Ameri and Marvasti proposed a hybrid
method of using adaptive filters on the lower scales of a
wavelet transformed speech together with the
conventional methods (Thresholding, Spectral Subtraction
and Wiener filtering) on the higher scale coefficients [5].
Yu Shao and Chip-Hong Chang proposed a versatile
speech enhancement system based on Perceptual Wavelet
De-noising, in which a modified wavelet de-noising with
soft-thresholding substitutes the adaptive noise
cancellation to improve the efficiency and robustness of
the proposed system [6]. Tai-Chiu Hsung and Daniel
Pak-Kong Lun introduced multitaper spectrum estimators
into their work and developed a wavelet de-noising
method with adaptive threshold for estimating power
spectrum using multitaper [7]. Lan Xu and Hong Keung
Kwan presented a novel adaptive wavelet de-noising
system which combines both the noise estimate and the
adjuster algorithm. The system also introduced a module
for musical noise suppression and a silent segment
smoother to improve the final perceptual quality [8].
However, these algorithms either distort the signal or
leave the residual noise out while preserving the speech
components because they do not focus on the distinction
between voiced and unvoiced signal.
To preserve useful information in a signal and remove
as much noise as possible, this paper presents a new
adaptive wavelet de-noising algorithm based on the
classification of voiced and unvoiced signal. The speech
signal is first divided into several frames with certain
overlaps among each neighboring parts. Then for each
frames, different wavelet transforming approaches and
dynamic thresholds are applied, based on its short-time
energy that determines whether it is voiced signal or not.
A new thresholding function is also introduced in the
paper to offset the shortcomings of the conventional one.
The rest of this paper is organized as follows: Section II
will introduce the conventional wavelet de-noising
algorithm based on universal threshold. Section III will
introduce our proposed algorithm in wavelet de-noising
system and its superiorities upon the conventional ones.
Section IV will give simulations and compare the
performance of different de-noising methods in a real
system. Finally, section V will give a conclusion of this
paper.

Keywords-speech enhancement; voiced/unvoiced speech;


dynamic threshold; adaptive thresholding

I. INTRODUCTION
Speech enhancement aims at improving the
performance of speech communication systems in noisy
environment by suppressing the noise and improving the
perceptual quality and intelligibility of the speech signal.
It is an important research field within speech signal
processing, with applications in many areas such as voice
communication and automatic speech recognition.
Traditional
speech
de-noising
techniques
are
predominantly based on Wiener filtering [1], spectral
subtraction algorithm [2] and subspace filtering [3], which
have attracted significant interest and investigation due to
their easy design and implementation. Although these
linear methods can reduce the noise and improve the
signal-to-noise ratio (SNR), they are not very effective
when signals contain sharp shapes or impulses of short
duration. To overcome these limits, a nonlinear approach
using Wavelet Transform (WT) to reduce the noise has
been proposed. It is based on the assumption that signal
magnitudes dominate the magnitudes of the noise in a
wavelet representation, so that a threshold is settled to
shrink the wavelet coefficients of noise and then an
inverse wavelet transform is made on the residual
coefficients to reconstruct the original speech.
Donoho [4] introduced the signal de-noising technique
using wavelet transform in 1995. However, it has two
limitations: 1) by using a universal threshold, it is unable
to track the change of the SNR and therefore ineffective
for non-stationary noise suppression; 2) when SNR is low
and the spectrum of speech and noise are overlapped, the
unvoiced signal is often lost after de-noising because of its
similarity with the noise signal. In a word, Donohos
method cannot make a balance between protecting the
978-0-7695-4356-7/11 $26.00 2011 IEEE
DOI 10.1109/CMSP.2011.150

II. WAVELET DENOISING


Since in real world, the additive noise takes a much
more proportion and the non-additive noise can be
310

Accordingly, for each frame, we will get n+1


sub-frequency bands (n is the layer number of wavelet
decomposition) and the average energy of each level will
be calculated. If such value of the highest sub-frequency
band is the largest and the value rate of the highest and
lowest sub-frequency band is less than a certain threshold,
then that signal frame will be considered as unvoiced.
However, it is not always making perfect and easy
decisions between voiced and unvoiced signals using the
judging standard mentioned above, because the two types
of signal will become similar in feature appearance when
noise exists. Thus in this paper, we will first remove part
of the speech noise with spectral subtraction algorithm
before making the judgment.
Suppose that the noise-corrupted speech signal is given
by (1). Taking the Fourier transform of both sides, we can
get

transformed into it, this paper mainly focus on signal


de-noising approach of additive noise. Suppose that the
original signal s k of length N is corrupted by

( )

(k ) , and then the noisy signal x(k ) is given as:


x(k ) = s (k ) + d (k ) , 1 n N
(1)
According to [5], de-noising of the signal s (k ) is

noise d

performed by first taking the wavelet transform of the


noisy data and then the threshold T is defined as:

T = 2 ln N
(2)
for discrete wavelet transform and for wavelet packet
transform case:

( ))

T = 2 ln N log 2 N
(3)
where N is the length of the noisy signal. is an
estimated value of noise standard deviation and is given
by:

= MAD / 0.6745 = Median( c )/ 0.6745

X ( j ) e j x ( ) = S ( j ) e js ( ) + D( j ) e jd ( ) (7)

(4)

where

X ( j ) , S ( j ) and D( j ) are the spectral

where MAD is the Median Absolute Deviation of the


wavelet coefficients at the highest resolution; c is the
coefficients sequence from the wavelet transform.
Hard and soft thresholding can be applied to the
coefficients obtained after wavelet decomposition. Usually
thresholding is applied on the detailed coefficients and the
approximate coefficients (the low-pass filter output) are
left untouched. In the hard thresholding, all the
coefficients with the absolute value below T are forced to
zero. For the detailed coefficient w j ,k , the hard

s ( ) and d ( ) are the corresponding phase


spectrums. Since the phrase distortion is not perceived by
human ear, the phase of background noise and clean
speech can be substituted by that of the noise-corrupted
speech signal. So (7) can be rewritten as:

thresholding is represented as follows:

where

w j ,k

w j , k , w j ,k T

=
0, w j ,k < T

magnitudes of noise-corrupted speech, clean speech and


additional background noise, respectively. y ( ) ,

X ( j ) e j x ( ) = S ( j ) e j x ( ) + D ( j ) e jx ( ) (8)

estimations of clean speech and background noise,


respectively. Then the estimation of the magnitude
spectrum of clean speech can be represented as:

(5)

x ( j ) = Y ( j ) D ( j )

In the case of soft thresholding, the detailed


coefficients are modified as:

w j ,k

S ( j ) and D ( j ) are the spectral magnitude

(9)

Finally, taking an inverse Fourier transform using the


phase spectrum of the noise-corrupted speech, the
enhanced speech s(n ) is obtained as:

sgn( w j ,k ) w j ,k T , w j , k T

=
(6)
0, w j ,k < T

s(n ) = IFFT S ( j )e j x ( )

Finally, an inverse wavelet transform is made on the


residual coefficients to reconstruct the original speech
signal.

(10)

Since the noise magnitude spectrum cannot be


estimated accurately, musical noise is often introduced
into the enhanced speech by the conventional spectral
subtraction algorithm. In order to reduce these annoying
musical noises, we adopt an improved form of spectral
subtraction algorithm in this paper given by Virag [9]:

III. PROPOSED WAVELET DENOISING SYSTEM


A. Classification of Voiced/Unvoiced Signal
There are two types of speech signal: voiced and
unvoiced. On the one hand, the unvoiced signal is
quasi-periodic; it contains a huge number of
low-frequency components and a majority of the speech
energy. On the other hand, with a high frequency and low
energy, waveform of the unvoiced signal is much similar
to that of the noise and will suffer from being eliminated
in wavelet thresholding. Thus, a separation of voiced and
unvoiced signal is proposed in this paper to improve the
quality of speech reconstruction.
In this paper, we select unvoiced signal according to
its short-time energy. First, the speech signal is divided
into several short-time segments, also known as frame
and then wavelet transform is made on each of them.

H ( ) = H [SNRpost ( )]

1 2
1

(
)
D
, if D ( ) 1
1

+ (11)
Y ( )
Y ( )

=
2
D ( ) 1


, otherwise

(
)
Y

where is an over-subtraction factor ( > 1 ) which

determines the balance of the amount of noise reduction

311

and

speech

(0

distortion.

Spectral

flooring

be some vibrations in the reconstructed signal; the filtering


output signal is smoother by using soft threshold algorithm,
but the reconstructed signal will be anamorphic due to a
decreased coefficient value after thresholding and the SNR
is often very low.
To improve the performance of wavelet de-noising
techniques, we propose a new threshold algorithm
expressed as follows:

<< 1 ) is the addition of background noise in

order to mask the residual noise. Exponent 1 and 2


determine the sharpness of the transition from
H ( ) = 1 (the spectral component is not modified) to

H ( ) = 0 (the spectral component is suppressed). The

choice of the subtraction parameters , , 1 and 2


is a central notion in single channel speech enhancement.
Indeed at low SNRs, it is impossible to simultaneously
minimize speech distortion and residual noise. In our case,
we are concerned with reducing noise while keeping the
speech distortion acceptable for ensuing work.
B. Combination of Mallat Algorithm and Wavelet
Package Transform (WPT)
Conventional wavelet de-noising mainly bases on the
theory of multi-resolution analysis and Mallat algorithm.
However, wavelet analysis with Mallat algorithm only
deals with the low frequency band of each floor, without
concerning with the high frequency band. Such analytical
model may not be suitable for unvoiced speech signal,
spectrum of which is mainly located in high frequency
areas overlapped with the noise. In comparison, WPT, an
evolvement of WT, takes into account of both low and
high frequency bands and is much suitable for a precise
analysis of unvoiced speech. Accordingly, in this paper,
after the classification of voiced and unvoiced signal, we
adopt different wavelet analytical approaches for them:
WT for voiced and WPT for unvoiced.

the

< W j ,k < ,

the exponential function will offset the disadvantages of


both hard and soft thresholding algorithm and make the
curve much smoother. It is an optimal, flexible algorithm
with much stability and adaptability, and thus has an
excellent performance in wavelet de-noising.
IV. EXPERIMENTAL RESULTS
This section presents the performance evaluation of the
proposed enhancement algorithm, as well as comparisons
with the conventional ones under different SNR input. In
the experiment, the clean utterance sequences are chosen
from pure speech database. They are corrupted by the
white Gaussian noise, whose power is adjusted so as to
give different input SNR in the range of -5 to 10 dB. The
speeches are sampled at 1.125 KHz and quantized to 16
bits. Based on this value, the following parameters have
been chosen: 1) frame size N = 240 with 50% overlap; 2)
the signal is decomposed into four levels by using Symlet8
function. We use SNR and Mean Square Error (MSE)
(given by (16)) to evaluate the performance of each

(12)

algorithm, where
signal and

estimated value of noise standard deviation at scale j and


is given by:

where

W j ,k and W j ,k . When W j ,k ,

amount by adjusting the value of . When

where N is the length of the noisy signal. j is an

, factor is added to determine the

the signal details can be preserved according to their

T j = j 2 ln ( N log 2 ( N )) , j = 1,2, " , n (13)

0 < < 0 1 . From (15) we can see

difference between

For the unvoiced segment analyzed by WPT, the threshold


is given by:

j = Median W j ,k / 0.6745

where

that when W j ,k

C. Dynamic Wavelet Threshold


The efficiency of a thresholding technique strongly
depends on the threshold utilized. Donoho and Johnstone
[4] proposed a universal threshold given by (2) and (3). In
this paper, we use dynamic threshold to cope with
different wavelet analytical scales respectively. For the
voiced segment of a speech signal analyzed by WT, the
threshold is given by:

T j = j 2 ln N , j = 1,2, " , n

W j , k

W sgn (W )(1 ) ,
W j ,k
j ,k
j ,k
W j ,k (15)
= 0,

2
W j ,k +
,
1 + exp(2W j ,k / ) others

f ( n ) represents the original speech

f ( n ) is a reconstructed one. The whole

experiment is performed on MatLab 7.6 platform and is


composed of three parts in all to show great feasibility and
superiority of our method in wavelet de-noising.

(14)

MSE =

W j ,k is the high frequency coefficient of

j scale.

1
M

( f (i ) f (i ))
M

(16)

i =1

A. Classification of Voiced/Unvoiced Signal


This test is intended to ensure that a speech signal
often contain voiced and unvoiced segment and it is
important to remove part of the noise before applying
short-time energy standard to make judgment on them. In
the test, the clean speech signal seven is coming from
the pure speech database and its voiced and unvoiced

D. Adaptive Thresholding Algorithm


Hard and soft thresholding algorithms are given by (5)
and (6) respectively. However, each of them has its own
disadvantages. Hard thresholding algorithm can reserve
much features of a signals marginal area, but there may

312

segment are shown in Figure 4. From the figure we


conclude that in the word seven, the pronunciation of
s is an unvoiced signal and v is a voiced signal.

B. Performance Evaluation of Different Wavelet


De-noising Approaches
This test is meant to assess the performance of
different wavelet de-noising algorithms. In the test, we
have made experiments on two speech signals. One is
from the pure speech database and the other is from a
womans voice recorded in silent environment. For
comparison
between
these
methods,
different
reconstructed speech waveforms are displayed in Figure 6
and 7. In the test, we use soft thresholding algorithm and
the value of SNR input for the two cases are 5 and 0 dB
respectively. These waveforms apparently show that the
proposed wavelet de-noising method based on the
classification of voiced/unvoiced signals and dynamic
threshold can reach better performance in real system.

Figure 1. Classification of voiced/unvoiced signal

In order to test the influence of noise on the


performance of classifying voiced/unvoiced signal with
short-time energy standard, we made a comparison of
unvoiced signal waveform in the same voice seven
among three cases: the pure speech, noise-corrupted
speech and de-noising speech, the result of which is
shown in Figure 5. In the figure, Unvoiced signal 1
represents the unvoiced signal extracted from the pure
speech signal. When white Gaussian noise is added, the
unvoiced part extracted from the signal is given by
unvoiced signal 2, in which case mistakes occur. After
we remove part of the noise using spectral subtraction
algorithm, the unvoiced signal part extracted is close to
case one again. From the test we can conclude that the
noise indeed affect the performance of short-time energy
standard in making judgment on voiced/unvoiced signal.

Figure 3. Performance of conventional and the proposed wavelet


de-noising on speech database (5 dB SNR Input)

Figure 2. Comparison of three conditions under which unvoiced signal


is extracted

Figure 4. Performance of conventional and the proposed wavelet


de-noising on speech samples of our own record (0 dB SNR Input)

313

performance in reducing noise while keeping great


intelligibility of the original speech. Our future work
should include: 1) searching new methods to classify
voiced and unvoiced signal in a speech; 2) in our
experiment, the speech signal was decomposed only into
four levels by symlet8 base function and thus different
wavelet base functions and decomposing levels should be
tested to determine the best enhancement performance and
3) for both WT and WPT, some better thresholding
criterions can be found to better protect useful information
in a speech while reducing as much noise as possible.

To evaluate the performance of our suggested


algorithm in a more precise way, SNR and MSE are
calculated under different SNR inputs and the results are
shown in TABLE I. It is assumed that the absolute clean
speech is not available, so for estimating the power of the
clean speech, the power of some silent frames is calculated
and then is subtracted from the power of active frames.
From the table we can obviously see that the proposed
method can greatly improve SNR with less distortion of
the original speech.
TABLE I.

COMPARISON OF SNR AND MSE IN DIFFERENT


WAVELET DENOISING METHODS

-5

R. B. G. thanks Prof. Zhang for his valuable


comments of this paper, and also thanks Chen Yuan-yuan,
she supports partial this work.

Dynamic Threshold

Universal Threshold
SNR Input/dB

ACKNOWLEDGMENT

SNR Output/dB

MSE

SNR Output/dB

MSE

12.70

0.5301

13.51

0.5088

19.27

0.3808

20.71

0.3550

23.77

0.3046

24.70

0.2908

10

25.84

0.2747

26.32

0.2682

REFERENCES
[1]

L.Akter and M.K.Hasan, Crosscorrelation Compensated Wiener


Filter for Speech Enhancement, in Proc. Int. Conf. Acoust.,
Speech and Signal Processing (ICASSP), 2006, Toulousu, France,
2006.
[2] S. F. Boll, Suppression of acoustic noise in speech using spectral
subtraction, IEEE Trans. on Acoustics, Speech and Signal
Processing, vol. ASSP-27, pp. 113-120, Apr 1979.
[3] K. Hermus, P. Wambacq and H. Van Hamme, A review of signal
subspace speech enhancement and its application to noise robust
speech recognition, EURASIP Journal on Advances in Signal
Processing, vol. 2007, 15 pages, 2007.
[4] D.L Donoho, De-Noising by soft-thresholding, IEEE
Transactions on Information Theory, Vol. 41, No. 3, pp. 613-627,
May 1995.
[5] Mohammad A Akhaee, Ali Ameri and Farokh A Marvasti,
Speech Enhancement by Adaptive Noise Cancellation in the
Wavelet Domain, International Conference, Information,
Communications and Signal Processing, pp, 719-723, 2005
[6] Y. Shao and C.H.Chang, Center for Integrated Circuits and
Systems, A versatile speech enhancement system based on
perceptual wavelet denoising, Proc. of IEEE International
Symposium on Circuits and Systems, Vol. 2, 2005, pp. 864 - 867.
[7] Tai-Chiu Hsung and Daniel Pak-Kong Lun, Speech enhancement
based on adaptive wavelet denoising on multitaper spectrum,
Proc. of IEEE International Symposium on Circuits and Systems,
Vol. Issue 18-21, pp. 1700-1703, May 2008.
[8] Lan Xu and
Hon Keung Kwan, Adaptive wavelet denoising
system for speech enhancement, Proc. of IEEE International
Symposium on Circuits and System, Vol. Issue 18-21, pp,
3210-3213, May 2008.
[9] Nathalie Virag, Single channel speech enhancement based on
masking properties of the human auditory system, IEEE Trans.
on Speech and Audio Processing, Vol. 7, No. 2, pp. 126-137, Mar
1999.
[10] S. Ayat, M. T.Manzuri and R. Dianat, Wavelet Based Speech
Enhancement Using a new Thresholding Algorithm, Proceeding
of International Symposium on Intelligent Multimedia, Video and
Speech Processing, pp. 238-241, Oct 2004.

C. Performance of Four Thresholding Algorithms


In order to evaluate the de-noising performance of the
four thresholding methods mentioned in this paper (hard,
soft, semi-soft [10] and the proposed one), we have made
another four tests on the improvement of SNR and MSE
by using the suggested approach under different SNR
input (-5dB, 0dB, 5dB and 10dB) on another speech signal.
The experimental results are shown in TABLE II and III.
TABLE II.

SNR OUTPUT OF FOUR THRESHOLDING METHODS

Method SNR
Hard
Soft
Semi-soft
New
TABLE III.

-5

10

9.52
11.24
12.68
13.51

18.90
20.44
20.71
21.23

24.86
24.70
25.38
26.13

27.52
26.32
28.82
29.41

MSE OUTPUT OF FOUR THRESHOLDING METHODS

Method SNR
Hard
Soft
Semi-soft
New

-5

10

0.6211
0.5702
0.5231
0.5088

0.3886
0.3600
0.3550
0.3465

0.2886
0.2908
0.2819
0.2708

0.2526
0.2682
0.2453
0.2367

From the two tables we can see that when SNR input
is low, soft thresholding algorithm has a better
performance than that of hard thresholding. However,
with the increase of SNR input, such tendency is reversed
gradually. Semi-soft thresholding algorithm performs
better than hard and soft thresholding algorithm to some
extent. Performance of the new thresholding algorithm
proposed in this paper is the best under whatever SNR
input conditions.
V. CONCLUSION
In this paper, an adaptive speech enhancement
algorithm based on the classification of unvoiced and
voiced signal is proposed. By applying dynamic threshold
to different wavelet analytical scales and using an adaptive
thresholding function to shrink the wavelet coefficient of
noise, the proposed algorithm has achieved a much better

314

Das könnte Ihnen auch gefallen