CHAPTER2
THEORYTICAL STUDY
2.1 Speech Enhancement
Speech quality is severely degraded in the presence of acoustic noise and the
degradation depends largely on the characteristics of noise and environment. Speech
enhancement algorithms improve the quality and intelligibility of speech by reducing
or eliminating the noise component from the speech signals. Improving quality and
intelligibility of speech signals reduce listeners fatigue, improve the performance of
hearing aids cockpit communication videoconferencing speech coders and many
other speech processing systems.
Speech enhancement is a very important part of todays world. It is used in a
number of applications in order to reduce the noise around us from cars, trains, planes
or even just crowds. Everywhere we go there is noise to be heard. Due to this speech
enhancement techniques benefit a wide range of applications such as mobile phones,
hands free phones and speech recognition services. There are many different
techniques to enhance the quality of speech for the listener.
Fig. 2.1: General speech enhancement
2.2 Classification of Speech Enhancement technique
Speech enhancement systems can be classified in a number of ways based on
the criteria used or application of the enhancement system.
Table: Speech enhancement processing strategies
2
Domain
Possible Strategies
Number of input channels
One/Two/Multiple
Domain of processing
Time/Frequency
Type of algorithm
Nonadaptive/Adaptive
Additional constraints
Speech Production/Perception
2.3 Time domain versus Transform domain
Speech enhancement can be performed in both time and frequency domains.
Time domain techniques include those utilizing Finite Impulse Response (FIR) filters
and Infinite Impulse Response (IIR) filters, Linear Predictive Coefficients(LPC) ,
Kalman filtering , Hidden Markov Models etc. Transform domain techniques are
techniques in which a transformation is first performed on the noisy speech before
filtering, followed by the corresponding inverse transformation in order to restore the
original speech.
2.4 Methods for Speech Enhancement
The algorithms of speech enhancement for noise reduction can be categorized
into three fundamental classes: filtering techniques, spectral restoration, and model
based methods.
 Spectral Subtraction Method
 Wiener Filtering
 Signal subspace approach (SSA)
2.4.1 Spectral Subtraction
The spectral subtractive algorithm is historically one of the first algorithms
proposed for noise reduction. Simple and easy to implement, based on the principle
that the noise spectrum is usually estimated, and updated, from the periods when the
3
signal is absent and only the noise is present. The assumption is that the noise is a
stationary or a slowly varying process, and the noise spectrum does not change
significantly in between the update periods. For restoration of timedomain signals, an
estimate of the instantaneous magnitude spectrum is combined with the phase of the
noisy signal, and then transformed via an inverse discrete Fourier transform to the
time domain. In terms of computational complexity, spectral subtraction is relatively
inexpensive. However, owing to random variations of noise, spectral subtraction can
result in negative estimates of the shorttime magnitude or power spectrum. The
magnitude and power spectrum are nonnegative variables, and any negative
estimates of these variables should be mapped into nonnegative values. This
nonlinear rectification process distorts the distribution of the restored signal. The
processing distortion becomes more noticeable as the signaltonoise ratio decreases.
Let y(n) be the noise corrupted input speech signal which is composed of the
clean speech signal x(n) and the additive noise signal d(n). In mathematical equation
form we can write y(n) in time domain and fourier domain as given in equation (1)
and (2) respectively.
y(n) = x(n) + d(n) (1)
Y[w] = X[w] + D[w] (2)
Y[w] can be expressed in terms of Magnitude and phase as
y
j
e w Y w Y

) ( ] [ =
(3)
Where ) (w Y is the magnitude spectrum and is the phase spectra of the corrupted
noisy speech signal. Noise spectrum in terms of magnitude and phase spectra as
y
j
e w D w D

] [ ] [ =
(4)
4
The magnitude of noise spectrum [D(w)] is unknown but can be replaced by
its average value computed during non speech activity i.e. during speech pauses. The
noise phase is replaced by the noisy speech phase
y
 that does not affect speech
intelligibility.
Figure 2.2 . Block diagram of Spectral Subtraction enhancement method.
So one can estimate the clean speech signal simply by subtracting noise spectrum
from noisy speech spectrum, in equation form
 
y
j
e w D w Y w Xe

) ( ) ( ) ( =
(5)
Where Xe(w) is estimated clean speech signal. Many spectral subtractive algorithms
are there depending on the parameters to be subtracted such as magnitude
(Amplitude) spectral subtraction, power spectral subtraction, autocorrelation
subtraction. The estimation of clean speech magnitude signal spectrum is
] [ ] [ ] [ w D w Y w Xe =
.. (6)
Similarly for Power spectrum subtraction is
2 2
2
] [ ] [ ] [ w D w Y w Xe =
. (7)
The enhanced speech signal is finally obtained by computing the Inverse Fourier
Transform of the estimated clean speech ] [w xe for magnitude. Spectrum subtractions
and
2
] [w xe for power spectrum subtraction using the phase of the noisy speech
signal. The more general version of the spectral subtraction algorithm is
5
p p
p
De Y Xe ] [ ] [ ] [ e e e =
. (8)
Where p is the power exponent, when p=1yielding the magnitude spectral subtraction
algorithm. The spectral subtraction algorithm is computationally simple as it only
involves a forward and Inverse Fourier Transform.
2.4.1.3. Drawbacks of Spectral Subtraction method
While the spectral subtraction method is easily implemented and effectively
reduces the noise present in the corrupted signal, there exist some glaring
shortcomings, which are given below:
 Residual noise (musical noise)
It is obvious that the effectiveness of the noise removal process is dependent
on obtaining an accurate spectral estimate of the noise signal. The better the noise
estimate, the lesser the residual noise content in the modified spectrum. However,
since the noise spectrum cannot be directly obtained, an averaged estimate of the
noise is considered. Hence there are some significant variations between the estimated
noise spectrum and the actual noise content present in the instantaneous speech
spectrum. The subtraction of these quantities results in the presence of isolated
residual noise levels of large variance. These residual spectral content manifest
themselves in the reconstructed time signal as varying tonal sounds resulting in a
musical disturbance of an unnatural quality. This musical noise can be even more
disturbing and annoying to the listener than the distortions due to the original noise
content.
Several residual noise reduction algorithms have been proposed to combat this
problem. However, due to the limitations of the singlechannel enhancement methods,
it is not possible to remove this noise completely, without compromising the quality
of the enhanced speech. Hence there is a tradeoff between the amount of noise
reduction and speech distortion due to the underlying processing.
 Distortions due to half / full wave rectification
6
The modified speech spectrum obtained from Eq. (7) may contain some
negative values due to the errors in the estimated noise spectrum. These values are
rectified using halfwave rectification (set to zero) or fullwave rectification (set to its
absolute value). This can lead to further distortions in the resulting time signal.
 Roughening of speech due to the noisy phase
The phase of the noisecorrupted signal is not enhanced before being
combined with the modified spectrum to regenerate the enhanced time signal. This is
due to the fact that the presence of noise in the phase information does not contribute
immensely to the degradation of the speech quality. This is especially true at high
SNRs (>5 dB). However, at lower SNRs (<0dB), the noisy phase can lead to a
perceivable roughness in the speech signal contributing to the reduction in speech
quality. Estimating the phase of the clean speech is difficult and will greatly increase
the complexity of the method. Moreover, the distortion due to noisy phase
information is not very significant compared to that of the magnitude spectrum,
especially for high SNRs. Hence the use of the noisy phase information is considered
to be an acceptable practice in the reconstruction of the enhanced speech signal. Most
speech enhancement algorithms, including the spectral subtraction methods, try to
optimize noise removal based on mathematical models of the speech and noise
signals.
However, speech is a subtle form of communication and is heavily dependent on the
relationship of one frequency with another. Hence, while conventional speech
enhancement algorithms can increase the speech quality of the noisy speech by
increasing the SNR, there is no significant increase in speech intelligibility.
Algorithms should take into account the subtleties of speech and incorporate methods
based on the perceptual properties of the speech signal. The spectral subtraction
methods, as well as most other methods, suffer from this drawback.
 Modifications to spectral subtraction
Several variants of the spectral subtraction method originally proposed by Boll
have been developed to address the problems of the basic technique, especially the
presence of musical noise. Still other methods based on this method have been
developed that perform noise suppression in the autocorrelation, cepstral, logarithmic
7
and subspace domains. A variety of pre and post processing methods have also
proved to help reduce the presence of musical noise while minimizing speech
distortion. This section deals with the different techniques and enhancements that
have been proposed over the years.
 Magnitude averaging
Magnitude averaging of the input spectrum reduces spectral error by averaging
across neighboring frames. This has the effect of lowering the noise variance while
reinforcing the speech spectral content and thus preventing destructive subtraction.
The magnitude averaging is viable only for stationary time waveforms. Due to the
shortterm stationarity of speech, the number of neighboring frames over which the
averaging is done is limited. If this constraint is ignored, a certain slurring of speech
can be detected due to the smearing of different speech phonemes into each other. A
generalized representation of the averaging operation can be expressed as:
=
+
=
M
M j
j i j i
k Y W
M
k Y ) (
1 2
1
) (
(9)
where i is the frame index. The weights
=1 ( j), the equation reduces to the basic magnitude averaging operation. In the
case where the frames are weighted by different values for
, the operation is
referred to as weighted magnitude averaging.
 Generalized spectral subtraction
A generalized form of the basic spectral subtraction is given as:
a
a
a
k D k Y k S ) ( ) ( ) (
. .
+ =
..(10)
where the power exponent a can be chosen to obtain optimum performance. In the
case where
8
a = 2, the subtraction is carried out on the Shortterm Power Density Spectra
(STPDS) and is referred to as power spectral subtraction. When a = 1, the equation
reduces to the basic spectral subtraction method , where the subtraction is carried out
by subtracting the magnitude spectra.
 Spectral Subtraction using oversubtraction and spectral floor
An important variation of spectral subtraction was proposed by Berouti et al. for
reduction of residual musical noise. This proposed technique could be expressed as:
2
2
2
) ( ) ( ) ( k D k Y k S
. .
= o
2 2
2
2
) ( ) ( ) ( ) ( k D k S if k S k S
. . . .
>
= 
=
2
) (k D
.

else.
Where the oversubtraction factor is a function of the noisy signaltonoise ratio and
calculated as:
SNR
20
3
0
=
o
o
5dB SNR 20 dB
Where
o0
is the desired value of at 0 dB SNR. The oversubtraction factor ,
subtracts an overestimate of the noise power spectrum from the speech power
spectrum. This operation minimizes the presence of residual noise by decreasing the
spectral excursions in the enhanced spectrum. The oversubtraction factor can be seen
9
as a timevarying factor, which provides a degree of control over the noise removal
process between periods of noise update.
The spectral floor parameter prevents the spectral components of the enhanced
spectrum from falling below the lower value
2
) (k D
.
 . This operation fills out the
valleys between spectral peaks and the reinsertion of broadband noise into the
spectrum helps to mask the neighboring residual noise components. While the
proposed technique has proved to be successful in suppressing the residual noise to a
large extent, oversubtraction of the noise estimate also causes heavy speech
distortions.
Nonlinear Spectral Subtraction (NSS)
NSS is basically a modification of the over spectral subtraction method by making the
over subtraction factor frequency dependent and the subtraction process nonlinear. In
case of NSS assumption is that noise does not affects all spectral components equally.
Certain types of noise may affect the low frequency region of the spectrum more than
high frequency region. This suggests the use of a frequency dependent subtraction
factor for different types of noise. Due to frequency dependent subtraction factor,
subtraction process becomes nonlinear. Larger values are subtracted at frequencies
with low SNR levels and smaller values are subtracted at frequencies with high SNR
levels. The subtraction rule used in the NSS algorithm has the following form.
if e N e e Y e X
j j j j
) ( ) ( ) ( ) (
e e e e
o =
else e D e D e e Y
j j j j
) ( ) ( ) ( ) (
e e e e
 o + >
10
) (
e

j
e Y =
(14)
Where is the spectral floor parameter.
and
is a frequency dependent
subtraction factor and
=max 

Wiener Filtering
The Wiener filter rule is derived from the optimal filter theory. It is assumed that the
noisy signal is the clean speech s(t) with the uncorrelated additive noise n(t) and the
noise obeys normal distribution. The impulse response h(t) of the filter is derived in
the minimum mean squared error (MMSE) sense by minimizing the Euclidian
distance
.
2
) ( ) ( t s t s E . Where ) (t s
.
is the estimate of the clean speech from the
noisy speech y(t) and E{ }denotes the expectation operator. Assuming that both s(t)
and n(t) are shorttime stationary stochastic process, the essence is to solve the
WienerHopf equation.
da a R a h R
yy sy
) ( ) ( ) ( = t t
.(15)
Perform the fourier transform on the two sides of eq. (15)
) ( ) ( ) ( e e e
yy sy
P H P =
(16)
The signals s(t) and n(t) are independent to each other, thus
) ( ) ( ) ( e e e
n s yy
P P P + =
..
(17)
11
Substitute eq. (17) into eq. (16), we get the frequency solution of WienerHopf
equation.
) ( ) (
) (
) (
e e
e
e
n s
s
P P
P
H
+
=
(18)
Eq. (18) is the frequency response of the conventional Wiener filter. Where ) (e
s
P and
) (e
n
P is the power spectrum destiny (psd) of the speech and noise respectively.
) (e
n
P can be estimated from the noisy speech during regions of silence by use of a
noise estimation method, and ) (e
s
P can be obtained from the input signal by use of
the spectral subtraction method. In order that the finally estimated speech power
spectra is given by applying the Wiener filtering on the original power spectra of
noisy speech.
1. Signal Subspace method
This method has been used frequently in digital processing e.g. spectrum estimation,
system identification etc. The most important assumption in signal subspace methods
for speech enhancement is that the correlation matrices of vectors of the clean speech
signal need not be positive definite i.e. they have some eigen values which are
practically zero. This can be observed by either examining the empirical correlation
matrices of the speech signal or by studying the commonly used linear model for that
signal. This fact indicates that the energy of the clean signal vector is distributed
among a subset of its coordinates and the signal is confined to a subspace.
If the correlation matrix of an additive noise is assumed positive definite i.e. all noise
eigen values are strictly positive, then the noise vectors span the entire space. Thus the
noise components in the subspace complementary to the signal subspace can be
removed without degrading the clean speech signal. The decomposition of the vector
space of the noisy signal can be performed by applying the eigen decomposition to the
correlation matrix. However the second order statistics are estimated from a number
of noisy vectors, so a better approach is to organize the vectors in a data matrix and
use the singular value decomposition.
12
Linear Model:
A speech signal is nonstationary, but can in a shorttime window be considered as a
wide sense stationary stochastic process. The speech signal can therefore be
represented by a linear stochastic model of the form
=
= =
p
i
i i
h H s
1
u u
Where
T
m
s s s s ) ....... , (
2 1
=
is a sequence of signal random samples,
p m
R H
e is a model matrix and
T
p
) ....... , (
2 1
u u u u = is a zero mean random
coefficient vector drawn from a multivariate distribution.