Finding Structure in Audio For Music Information Retrieval - Naga Bhaskar

356
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 2, FEBRUARY 2010
Speech Enhancement Using Harmonic Emphasis

and Adaptive Comb Filtering
Wen Jin, Xin Liu, Michael S. Scordilis, Senior Member, IEEE, and Lu Han
AbstractAn enhancement method for single-channel speech

degraded by additive noise is proposed. A spectral weighting function is derived by constrained optimization to suppress noise in
the frequency domain. Two design parameters are included in the
suppression gain, namely, the frequency-dependent noise-flooring
parameter (FDNFP) and the gain factor. The FDNFP controls
the level of admissible residual noise in the enhanced speech.
Enhanced harmonic structures are incorporated into the FDNFP
by time-domain processing of the linear prediction residuals of
voiced speech. Further enhancement of the harmonics is achieved
by adaptive comb filtering derived using the gain factor with a
peak-picking algorithm. The performance of the enhancement
method was evaluated by the modified bark spectral distance
(MBSD), ITU-Perceptual Evaluation of Speech Quality (PESQ)
scores, composite objective measures and listening tests. Experimental results indicate that the proposed method outperforms
spectral subtraction; a main signal subspace method applicable to
both white and colored noise conditions and a perceptually based
enhancement method with a constant noise-flooring parameter,
particularly at lower signal-to-noise ratio conditions. Our listening
test indicated that 16 listeners on average preferred the proposed
approach over any of the other three approaches about 73% of
the time.
Index TermsConstrained optimization, harmonic enhancement, speech enhancement.
I. INTRODUCTION
HE enhancement of single-channel speech degraded by
additive noise has been extensively studied in the past
and remains a challenging problem because only the noisy
speech is available. Techniques have been proposed in the
literature to exploit the harmonic structure of voiced speech
for enhancing the speech quality [1][12]. In the work of [1]
and [2], voiced speech is modeled as harmonic components
plus noise-like components, and enhancement is performed
by estimating the harmonic components while reducing the
additive noise in the noise-like components. [3] extends the
Manuscript received June 24, 2008; revised July 02, 2009. First published
July 31, 2009; current version published November 20, 2009. The associate editor coordinating the review of this manuscript and approving it for publication
was Prof. Yariv Ephraim.
W. Jin is with Qualcomm, San Diego, CA 92121 USA (e-mail: wjin@qualcomm.com).
X. Liu and M. S. Scordilis are with the Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL 33146-0640 USA
(e-mail: x.liu6@umiami.edu; m.scordilis@miami.edu).
L. Han is with the Department of Electrical and Computer Engineering, North
Carolina State University, Raleigh, NC 27695 USA (e-mail: lhan2@ncsu.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2009.2028916
conventional hidden Markov model (HMM)-based minimum

mean square error (MMSE) estimator to enhance the harmonics
for voiced speech, it incorporates a ternary voicing state and
applies it to a harmonic representation of voiced speech. The
sinusoidal model is adopted in the speech enhancement context
in the algorithms of [4][8].
The aforementioned algorithms rely on some underlying
speech model to enhance the harmonics of voiced speech. An
alternative strategy is to process the speech directly either in
the time domain waveforms or frequency domain magnitudes.
[9] uses the fundamental frequency to narrow down the a
priori probability distribution (PD) of the DFT amplitude, and
consequently it improves the estimation of the DFT spectrum
and enhances the harmonics of voiced speech. The spectral
subtraction algorithm is used as a general and basic enhancement system [13]. A set of metrics is introduced to measure
the harmonicity of short-time pre-enhanced speech and serves
as an indicator of whether further enhancement is necessary.
The quality of voiced speech is improved by post-enhancing
the harmonic structures with adaptive comb filtering [14],
[15]. Ephraim introduced an MMSE estimator to enhance the
speech in [16], [17]. Then [11] exploits the correlation between
frequency components to improve an MMSE estimator of the
short-time complex spectrum. Kalman filtering has been used to
produce a time-domain optimal estimation of the clean speech
[18][20]. In [12], an artificial harmonic signal is synthesized
by nonlinear processing of pre-enhanced speech. This artificial
signal is then included in a suppression gain that modifies the
spectral magnitudes of the noisy speech.
In this paper, we propose a new method that enhances the
harmonics of voiced speech without ascribing to any underlying
speech models. The harmonic speech structure obtained through
short-time Fourier analysis is enhanced by applying a combination of time and frequency domain-based criteria, which are
applicable for white as well as for colored additive noise conditions. While similar principles are shared by [21][23], our
method addresses the voiced harmonics specifically, instead of
using general vector subspaces theory or typically formulated
using KarhunenLoeve transform (KLT) solutions [21], [22]. In
contrast to many state-of-the-art approaches, the proposed algorithm allows an admissible level of residual noise in the enhanced speech. Since in many real world applications, a complete removal of the degrading noise is neither feasible nor desirable, retaining a low-level background noise actually yields
better perceptual quality [24]. The proposed method improves
speech quality by suppressing the noise in the frequency domain with the use of a spectral weighting function. Two design parameters are introduced into the proposed suppression
1558-7916/$26.00 2009 IEEE

Authorized licensed use limited to: PONDICHERRY ENGG COLLEGE. Downloaded on July 05,2010 at 15:29:06 UTC from IEEE Xplore. Restrictions apply.
JIN et al.: SPEECH ENHANCEMENT USING HARMONIC EMPHASIS AND ADAPTIVE COMB FILTERING
357
gain, namely the frequency-dependent noise-flooring parameter

(FDNFP) and the gain factor. The FDNFP shapes the residual
noise in the frequency domain such that the harmonic structure of clean speech is preserved. To further enhance the harmonics of voiced speech, adaptive comb filtering is performed
using the gain factor by picking the harmonic peaks from the
noisy speech spectrum. Therefore, the proposed algorithm extracts and enhances the harmonics by operating in both the time
and frequency domains.
This paper is organized as follows. Section II describes the
principles of the proposed enhancement method. Section III
presents the techniques for enhancing the harmonic structures of
voiced speech. In Section IV, the performance of the proposed
method is evaluated. Finally, Section V draws conclusions.
. Because
varies with frequency and
bounded by
controls the level of residual noise at each frequency band , we
refer to as the frequency-dependent noise-flooring parameter
(FDNFP) in this paper. With (3) as our target of approximation,
the estimation error is
II. PRINCIPLES OF THE PROPOSED ENHANCEMENT METHOD
be the energy of speech distortion, where denotes the expectation and is matrix trace. Similarly, let
Suppose we have a single channel of noisy speech degraded

by additive noise. The noisy observation can be expressed as
(1)
vectors representing the noisy
where , , and are
speech, clean speech, and additive noise, respectively. is the
number of samples in each analysis frame. The additive noise
is assumed to be statistically uncorrelated to the clean speech.
the
Fourier transform matrix, where
inDenote by
dicates matrix Hermitian. The -point short time Fourier transform (STFT) of the noisy speech is then given by
(4)
and
represent the
where
speech distortion and residual noise, respectively. Let
(5)
(6)
denote the energy of residual noise in the th frequency band.
is the th spectral component selector defined as
We formulate the speech enhancement task as the following

constrained optimization problem originally proposed in [21]
(7)
(2)
are the Fourier transforms of the noisy
where , , and
speech, clean speech, and noise, respectively. Our enhancement
such that
task is to find a spectral domain linear estimator
produces a good approximation to the clean speech
should be
spectrum. Ideally, the enhanced signal spectrum
identical to the clean speech spectrum . Many enhancement
methods, e.g., [21], [25], have been proposed in the literature
between the estimated
to minimize some error norm
and clean speech spectra. However, in practical systems there
always exist residual distortions in the enhanced speech. Moreover, retaining a comfort level of residual noise in the enhanced
speech will actually improve the perceived quality in many situations. For example, in a telephone application, keeping a lowlevel natural sounding background noise will provide the far end
user a feeling of the near end atmosphere and avoid the impression of an interrupted transmission [24]. As stated in the previous section, complete removal of the noise is neither feasible
nor desirable. Therefore, we design our linear estimator in
such a way that the enhanced speech spectrum approaches
(3)
diagonal matrix with real-valued diagonal
where is a
is the frequency index. The
elements , and
parameters
admit certain level of noise to appear at each
are
frequency band in the enhanced speech. The values of
subject to
(8)
where
is the threshold used to suppress noise at the th spectral component. The estimator that satisfies (7) and (8) can
be found by following an optimization procedure similar to that
used in [21]. Specifically, is a stationary feasible point if it
satisfies the gradient equation of the Lagrangian
(9)
and
(10)
is the Lagrange multiplier for the th spectral compowhere
nent. Assume is real and symmetric. From
,
we obtain
(11)
where
and
is the time-domain autocorrelation matrix of
speech and noise, respectively.
is a
diagonal matrix of the Lagrange multipliers. The optimal estimator can be obtained by solving the matrix (11). Now let us
assume is also diagonal. The simplification comes after the
fact that the matrices
and
are asymptotically
and
are Toeplitz [26].
diagonal provided the matrices
358
The diagonal elements of

and
are the power
and
of the clean speech and
spectrum components
noise, respectively [25]. Therefore, we have the asymptotic diagonal solution as
(12)
where
are the diagonal elements of .
We now show how (10) can be satisfied with this diagonal
solution. With diagonal and using the fact of asymptotic diagonalization, (6) can be rewritten as
(13)
Let the constraints in (8) be satisfied with equality, then by substituting (13) into (8), we get
(14)
Substituting (12) into (14) and using the condition
have
, we
(15)
where
is the signal-to-noise ratio (SNR)
for the th spectral component. Substituting (15) into (12), we
obtain the final solution that satisfies both of the Lagrangian
gradient (9) and (10)
(16)
We can reduce the number of variables in (16) by setting the
to be a proportion of the noise power spectrum
threshold
. Let
, where
is the proportionality
factor and specifies the amount of attenuation of noise power.
Then (16) can be rewritten as
(17)
Obviously, we now have the flexibility of balancing between
the two design parameters in (17). The term
is introduced
because our enhancement target is the noise admitting specin (3). The value of
should be small in order to
trum
maintain a low-level of residual noise in the enhanced speech.
On the other hand, the parameter
dominates the value of
suppression gain. It can be deciphered as a conventional noise
for all , then
suppression function. In fact, if we let
and the second term
on the right-hand
side of (3) becomes zero. This means the enhanced speech specwill approach the clean speech spectrum . Several
trum
choices for the design of
have been proposed in [21]. Actually, if we let
and
then (17) reduces to the classical Wiener filter.
Therefore, (17) can be viewed as a combination of a gain

factor and a small positive noise-flooring parameter. Definitely,
should be within the range
the value of the gain
for a nontrivial design. The offset
controls the level of admissible residual noise in the enhancement output. The quantity
determines the final suppression level of the noise. It
is noteworthy that a similar design was proposed in [24]. Unlike
in our method, the parameter
in [24] was a
the FDNFP
scalar variable and not frequency-dependent. Furthermore, the
was derived from an estimation of the masking
gain factor
thresholds [24]. In the next section, we will show how the freand
can be utilized to enquency-dependent parameters
hanced the harmonics of voiced speech.
III. HARMONIC ENHANCEMENT

A. Harmonic Enhancement by Noise Flooring
Because voiced speech is quasi-periodic in nature, its
magnitude spectrum exhibits peaks and valleys separated by
harmonics of the fundamental frequency. The harmonic structure of clean voiced speech is often corrupted by the additive
noise spectrum. Many classical noise reduction methods that
use multiplicative spectral gains, e.g. short-time spectral amplitude modification estimations, fail to recover the harmonic
structure because they do not take advantage of the properties
of this redundancy in voiced speech.
on the right-hand
Now let us examine the second term
side of (3). If we retrieved the harmonics of the clean voiced
speech with some success, then we could incorporate this har. As a consequence, the
monic structure into the FDNFP
spectral envelope of the noise is shaped by the harmonics before being suppressed towards the noise-floor. This way, the
residual noise in the enhanced speech will be shaped to have the
same harmonic structure as the clean speech. In such a manner,
the harmonics of the clean voiced speech can be recovered.
Therefore, aside from suppressing the noise to a comfortable
in (17) can be used to enforce a harlow-level, the FDNFP
monic-shaping on the residual noise spectrum in the enhanced
speech.
In order to impose a harmonic envelope upon the FDNFP,
we propose here an approach to extract the harmonic structure of voiced speech in the time domain. The motivation for
time domain processing is to preserve the correlation between
both spectral amplitudes and phases when restoring the harmonics. Because the phase coherence in voiced speech is a significant source of correlation and corresponds to energy localization in the time domain [11], we retrieve the harmonic information from noisy speech by enhancing the excitation peaks in
the linear prediction residuals. Fig. 1 depicts the steps for computing the FDNFP .
For voiced speech, a linear prediction (LP) analysis is performed on the noisy speech. In our implementation, the classical autocorrelation method is used to derive the LP parameters. The model order is set to 15. The LP residual signal is processed in parallel by two different methods to enhance the excitation peaks. The first method attenuates the signal amplitudes
359
Fig. 1. Computation of the frequency-dependent noise flooring parameters.
between excitation peaks by windowing the LP residual signal

with a Kaiser window series. The duration of each window is
set to be equal to the pitch period. The centers (peaks) of the
windows are aligned in time with the peaks of excitation pulses.
The purpose of windowing is to enhance the amplitude contrast
between peaks and valleys of the excitation pulses.
In the second method, the LP residuals are averaged over the
pitch epoch
(18)
and
are the averaged and noisy LP residuals,
where
is the largest integer number of pitch periods
respectively.
in the current analysis frame.
is the number of samples in
one pitch period. is the time sample index, and is the pitch
epoch index. From (18), it should be noted that the duration of
, the averaged LP residual, is exactly one pitch period.
Then
is repeated during the whole analysis frame. The
motivation for this averaging is based on the fact that while
the LP bursts of voiced speech are quasi-periodic, the additive
noise tends to be random and uncorrelated. By averaging the LP
residuals over several pitch periods, the periodic components
will therefore be enhanced while the uncorrelated random components will be suppressed. In order to provide the necessary
pitch information for the aforementioned windowing and averaging process, a pitch detection algorithm is run in parallel to
determine the pitch period of the current frame. Here we use the
relatively simple SIFT (Simple Inverse Filter Tracking) method
[27] for pitch determination. Although the performance of the
optimal temporal similarity method in [28] is better, it is more
complicated to implement and it hence increases the computational load. So the enhanced SIFT method is chosen instead.
The final processed LP residual with enhanced periodicity is obtained by
(19)
is the window-enwhere is a smoothing factor, and
is obtained by periodically
hanced LP residuals.
Fig. 2. Waveform of clean speech and its LP residual, (a) clean speech, (b) LP
residual of clean speech.
in (18) over the entire duration of the analysis

extending
is the final LP residual with enhanced periodframe.
icity. Because the averaging-enhanced residuals may not be
as accurate as windowing-enhanced residuals, due to shimmer
is then transfor example, the parameter is set to 0.8.
formed to the frequency domain, and its magnitude spectrum
is normalized to 0 dB by its maximum magnitude. Finally, the
is obtained by scaling the normalized spectrum of
FDNFP
to some comfort noise level. In our implementation,
is scaled down by 5 dB
the normalized spectrum of
for strongly voiced speech. This figure was experimentally
optimized so that the level of residual noise permitted in the
enhanced signal is kept at a low level. Figs. 24 demonstrate
the process of obtaining the FDNFP.
Figs. 2 and 3 illustrate the harmonic enhancement of LP residuals. Fig. 2 shows a frame of clean speech and its corresponding
LP residual. Fig. 3(a) depicts the noisy speech obtained by degrading clean speech with white Gaussian noise at a SNR of
360
Fig. 3. Harmonic enhancement in linear prediction residuals, (a) speech in 2(a)

degraded by white Gaussian noise (SNR = 5 dB), (b) LP residual of noisy
speech, (c) LP residuals with enhanced periodicity.
5 dB, the noisy LP residual is shown in Fig. 3(b). The final peof (19) is plotted in Fig. 3(c)
riodicity-enhanced residual
where noise suppression is clearly evident. In Fig. 4, the magnitude spectrum of the clean, noisy, and periodicity-enhanced LP
residual are plotted in Fig. 4(a)(c) respectively, and the FDNFP
is shown in Fig. 4(d).
B. Adaptive Comb Filtering
From (17) we can see that it is also beneficial to incorporate
for voiced speech. The reason is beharmonic structures into
is the dominant term of the suppression gain while the
cause
are usually small. This way, the level of residual
values of
noise can be more effectively suppressed. Further improvement
of perceptual quality can be achieved by imposing a harmonic
envelope on for voiced speech. However, unlike the which
should
are relatively flat over the entire frequency range, the
follow closely the spectral tilt as well as the formant peaks and
as an
valleys of the speech. Therefore, we implement the
adaptive comb filter by utilizing the spectral peak-picking algorithm proposed in [29].
The peak-picking method in [29] was proposed as part of
a concatenative speech synthesis algorithm that uses the Harmonic plus Noise Model (HNM). Here it is used as a means
to determine the frequency locations of the comb peaks. Because the spectral peaks were picked from the spectrum of clean
speech in [29], some modifications and postprocessings to the
Fig. 4. Spectrum of linear prediction residuals and FDNFP, (a) spectrum of

clean LP residuals, (b) spectrum of noisy LP residuals (SNR = 5 dB), (c)
spectrum of periodicity-enhanced LP residuals, (d) FDNFP.
peak-picking method are introduced in this paper for a more reliable performance on the spectrum of noisy speech. Specifically,
the harmonic test is modified as
(20)
or
dB
(21)
then, if
(22)
and
(23)
frequency
is declared voiced, otherwise
is declared unvoiced. The notations in (20), (21) and (22) are the same as dedenotes the frequency location of the
fined in [29], where
peak under test within the range
, and
are the frequencies of the peaks within the same range except
frequency .
is the initial fundamental frequency estimate
and
are the
using an enhanced SIFT method [28].
and , respectively.
and
deamplitudes at
note the cumulative amplitude at
and , respectively. The
cumulative amplitude is defined as the non-normalized sum of
the amplitudes of all of the samples from the previous valley
denotes the mean
to the following valley of the peak.
value of the cumulative amplitudes
. is the index of
as
the nearest harmonic to . Having classified frequency
voiced or as unvoiced, the next interval
is searched for its largest peak and the same harmonic test is
applied. The process is continued throughout the speech bandwidth. The measurements of (20), (21), and (22) were originally
introduced in [29].
In this paper, we have added the tonality measure in (23)
to the harmonic test. The advantage of the tonality test is to
effectively remove the spurious peaks caused by white noise.
The quantity SFM in (23) denotes the spectral flatness measure
as defined in [30]
SFM
361
Fig. 5. Interpolation of a single harmonic peak and rejection of spurious peaks.

Magnitude spectra of clean speech (solid line) and noisy speech (dotted line).
Peaks picked by the modified peak-picking method ( ). Peaks after postprocessing ( ). White Gaussian noise, input SNR = 5 dB.
(24)
and
denote the geometric mean and arithmetic
where
mean of the power spectrum in the range
,
50 dB in our implementarespectively. We used SFM
tion. In other words, an SFM of 50 dB indicates the signal is
entirely tonelike.
Even though the peak-picking method is modified as above,
some real harmonic peaks are rejected and some spurious peaks
are accepted because of the distortion effects of additive noise.
Moreover, the harmonics of clean speech in the spectral valleys
are often submerged by the noise spectrum and consequently
these harmonic peaks can not be picked by the peak tracking
method. To overcome these problems, the following postprocessing steps are performed on the peaks picked by the modified
algorithm.
1) Interpolation of a single harmonic peak. A local peak is
declared a harmonic peak if both of the two conditions
are true:
its frequency is within 15% of
, the nearest
harmonic frequency;
there are at least three peaks before and two peaks after
it.
2) Rejection of isolated peaks. A harmonic peak is rejected if
its distance to the nearest neighboring peaks is either less
or greater than
.
than
3) Recovery of multiple submerged intermediate peaks. Let
and
be some positive integers and
.
Multiple harmonic peaks are interpolated based on the
following tests:
there are no peaks picked in the frequency range
;
there are at least three good harmonic peaks in the
and at least another three harmonics in
range
.
If both of the above conditions are true, then harmonics
. Assume the
are interpolated in the range
Fig. 6. Recovery of multiple peaks submerged by noise. Magnitude spectra

of clean speech (solid line) and noisy speech (dotted line). Peaks picked by
the modified peak-picking method ( ). Peaks after postprocessing ( ). White
Gaussian noise, input SNR = 5 dB.
last harmonic in
is located at
and the
has a frequency of
,
first harmonic in
then the interpolated harmonics have frequencies
.
Fig. 5 illustrates steps 1 and 2 of the postprocessing. In Fig. 5,
the spectra of the clean and noisy speech are depicted in solid
line and dotted line, respectively. The modified peak-picking
method is applied to the spectrum of noisy speech. The peaks
picked by the modified peak-picking method are marked by
. The final peaks after postprocessing are marked as
crosses
circles
. As shown in Fig. 5, the harmonic peak near 900 Hz
is interpolated by step 1. The spurious peaks above 1600 Hz are
rejected according to step 2.
Fig. 6 depicts step 3 of the postprocessing. As can be seen
from Fig. 6, the spectrum of the noisy speech is relatively flat in
the range 800 1700 Hz because of the effects of additive white
Gaussian noise. The harmonics of the clean speech are submerged by the noise spectrum. Since the conditions of step 3 are
satisfied, four peaks are interpolated in the range 800 1700 Hz.
It should be noted that the spurious peak near 800 Hz is already
eliminated by step 2 before the step 3 interpolation.
362
After finding as many additional frequency locations of harmonic peaks as possible, we are ready to design the gain factor
in (17) as an adaptive comb filter. In the first step, an initial
comb filter is implemented in the frequency domain as
otherwise
(25)
is the peak frequency as determined by the modiwhere
is the
fied peak-picking method and post-processings.
frequency response of the initial comb filter at frequency .
controls the width of the comb filter [10] and is set to 2 in
specifies the filter gain at
our implementation. The quantity
peak frequency . Notice in (25) the comb structures are only
implemented within the vicinity of one fundamental frequency
(pitch ) range centered at the peak frequency . The value
determines the filter response outside the frequency range
of
. Since there are many design choices for
and
are also flexible.
the gain factor , the designs of
and
as Wiener-type
In this paper, we implemented the
gains
Fig. 7. Adaptive comb filter for the noisy speech spectrum in Fig. 6.
(26)
and
(27)
where
is the estimated power spectrum of clean speech,
is the spectrum of noisy speech.
can be comand
puted directly from the noisy speech. The accurate estimation of
the clean speech spectrum is very crucial to the performance of
the proposed harmonic enhancement method. We have used the
classical spectral subtraction
(28)
where
is a zero-flooring parameter and
is the
estimated spectrum of noise. is simply the index of frequency
. In the following text, and
are interchangeable. To ob, the minimum statistics
tain the estimated noise spectrum
tracking method in [31] is implemented.
in (17) is obtained by
Eventually, the gain factor
dB
(29)
where 20 dB denotes an amplitude value of 0.01. Since the

has variable peak magnitudes
initial comb filter
and peak frequencies
, and normally the peak magnitudes
are larger than 20 dB, the gain factor in (29) can be referred
as
to as an adaptive comb filter. The motivation for choosing
in (26) is that the spectral gain in (17) reduces to the Wiener filter
when the estimation of the noise and
at the peak frequency
clean speech spectrum is perfectly accurate, i.e.,
and
, with
. The advantage of using
as in (27) is that the lost harmonic peaks can be retrieved
Fig. 8. Spectrum of harmonic enhanced speech, the corresponding spectrum

of clean and noisy speech, and peak-picking are shown in Fig. 6.
by accurate estimation of the clean speech spectrum, as will be

shown in the following example of Fig. 7.
For the noisy spectrum shown in Fig. 6, its corresponding
dominant gain factor is plotted in Fig. 7. Clearly, the gain factor
in Fig. 7 is an adaptive comb filter with variable peak magin
nitudes and peak frequencies. The advantage of using
(27) is also obvious, since the harmonic peak near 2400 Hz
was missed by the modified peak-picking method and postprocessing, but it is retrieved by the adaptive comb filter. The spectrum of the harmonic enhanced speech is shown in Fig. 8. By
comparing the spectrum of noisy speech in Fig. 6 with the enhanced spectrum in Fig. 8, we can see that the four harmonic
peaks in 800 1700 Hz are enhanced. The corresponding waveforms of the clean, noisy and harmonic enhanced speech are
shown in Fig. 9.
Finally, the proposed harmonic enhancement method requires
a voice activity detector (VAD) to classify the speech signal
and a pitch determination algorithm (PDA) to find the pitch of
voiced speech. Accurate PDA and robust VAD under noisy environments are well studied topics in research literature, and they
are also the crucial components for the success of the proposed
harmonic enhancement algorithm. In our implementation, the
enhanced SIFT method in [28] was used for pitch determination.
The VAD is based on short-time energy level, zero-crossing rate,
loudness, and success of pitch detection. The short-time speech
frames are classified as voiced, unvoiced, and silence. Harmonic
enhancement is only applied to voiced frames. The Wiener-type
363
Fig. 9. Waveforms of clean, noisy, and harmonic enhanced speech,

peak-peaking shown in Fig. 6, (a) clean speech, (b) noisy speech (white
dB), and (c) harmonic enhanced speech.
Gaussian noise SNR
=5
gain
in (27) is used for unvoiced and silent frames, and

dB for unvoiced,
dB for silent frames, or
parameter values of 0.01 and 0.001, respectively.
Fig. 10. Waveform and spectrogram of clean and noisy speech (female speech,
multitalker babble noise, SNR
dB), (a) clean speech, (b) spectrogram of
clean speech, (c) noisy speech, and (d) spectrogram of noisy speech.
=5
IV. PERFORMANCE EVALUATION

The harmonic enhancement method was tested with 60 sentences (30 male, 30 female) with durations between 46 s taken
from the TIMIT speech database. The sentences were downsampled to 8 kHz before noise samples were added. The noise
sources were downloaded from the IEEE Signal Processing Information Base [32]. Two types of noise were used, namely
white Gaussian noise and multitalker babble noise. The noise
power level was scaled and added to the downsampled clean
speech to generate noisy speech with SNR in the range of 0 to
20 dB with 5-dB steps. The enhancement was applied to 32 ms
(256 samples) frames of noisy speech with a 50% overlap between adjacent frames. This resulted in a frame shift of 16 ms.
The FFT size was 512 and the enhancement output was obtained
by the overlap-and-add method.
For comparison, we implemented and evaluated the spectral
weighting gain method of [24] referred to as the Just Notable
Distortion (JND), which is given by
(30)
where
,
and
are the masking threshold,
noise spectrum, and the JND weighting gain for the th spectral
component, respectively. is a noise-flooring parameter. In our

is estimated by the MPEG-4 psyimplementation, the
is obtained by the method in
choacoustical model [33],
[31], and
dB. Notice the noise-flooring parameter in
[24] is a constant for all frequencies.
For more complete comparison, we also implemented the
classical spectral subtraction signal enhancement method [13],
the Ephraim and Van Trees signal subspace speech enhancement
method [21], which was used in the white noise conditions, and
the LevAri and Ephraim subspace method for colored noise
[23].
The enhancement algorithms were evaluated by the generally
used ITU-PESQ (Perceptual Evaluation of Speech Quality)
scores [34] as well as the Modified Bark Spectral Distortion
(MBSD) measure [35]. The ITU-PESQ (P.862) converts the
disturbance parameters in speech to a MOS-like listening
quality score, which ranges from 0.5 to 4.5. The higher the
score, the better the perceptual quality [34]. It is claimed that
the PESQ scores have a 0.935 average correlation with the
subjective scores [36]. The MBSD measure [35] is an improvement of the Bark Spectral Distortion (BSD) objective measure
[37]. Both of the ITU-PESQ scores and MBSD measures are
objective measures that are claimed to be highly correlated to
364
Fig. 11. Waveform and spectrogram of JND [24] and Spectral Subtraction [13]
enhanced speech (female speech, multitalker babble noise, SNR
dB),
(a) JND-enhanced speech, (b) spectrogram of JND-enhanced speech, (c) spectral subtraction enhanced speech, and (d) spectrogram of spectral subtraction
enhanced speech.
Fig. 12. Waveform and spectrogram of subspace [23] and proposed harmonic
dB),
enhanced speech (female speech, multitalker babble noise, SNR
(a) subspace-enhanced speech, (b) spectrogram of subspace-enhanced speech,
(c) harmonic enhanced speech, and (d) spectrogram of harmonic enhanced
speech.
the subjective quality of speech. A comparison between the

ITU-PESQ and MBSD can be found in [38].
We also use the new composite objective measures developed
in [39], which are obtained by linearly combining existing objective measures to form new measures. Such measures aim to
predict the quality of noisy speech enhanced by noise suppression algorithms. Three different composite measures were used.
and spectrograms of the clean and noisy speech are depicted

in Fig. 10. The clean speech is the sentence Loris costume
needed black gloves to be completely elegant. spoken by a
female speaker. The noisy speech is obtained by degrading the
clean speech with multitalker babble noise at a SNR of 5 dB.
The waveforms and spectrograms of the JND-enhanced and
spectral subtraction enhanced speech are shown in Fig. 11,
while subspace [23] enhanced speech and the proposed Harmonic Enhanced (HE) enhanced speech are shown in Fig. 12.
It is evident that the HE speech preserves more harmonics
and suppresses more high-frequency noise than the other three
methods. Futhermore, the proposed method suppresses babble
noise-related harmonics more effectively than the competing
methods (e.g., around the time of 3 s).
Comprehensive test results are presented in Figs. 13 and 14.
The average PESQ scores and MBSD measures are used as
objective metrics for enhancement performance. Fig. 13 plots
the average enhancement results of the 60 sentences degraded
by white Gaussian noise. The average enhancement results of
the same 60 sentences degraded by multitalker babble noise are
shown in Fig. 14. The input SNR of both noise conditions are
=5
: A composite measure for signal distortion (SIG)

formed by linearly combining the log-likelihood ratio
(LLR), PESQ, and weighted-slope spectral distance
(WSS).
: A composite measure for background noise
2)
distortion (BAK) formed by linearly combining the
segmental SNR (segSNR), PESQ, and WSS measures.
: A composite measure for overall quality (OVL)
3)
formed by linearly combining the PESQ, LLR, and WSS
measures.
1)
Figs. 1012 show an example of enhancement of noisy

speech degraded by multitalker babble noise. The waveforms
=5
365
Fig. 13. Average PESQ scores and MBSD measures of 60 sentences of JND enhanced speech (dotted line), subspace enhanced speech (solid line), spectral
subtraction enhanced speech (dash-dot line), and harmonic enhanced speech (dashed line). The noise is white Gaussian at SNR of 0, 5, 10, 15, and 20 dB.
Fig. 14. Average PESQ scores and MBSD measures of 60 sentences of JND enhanced speech (dotted line), spectral subtraction enhanced speech (dash-dotted
line), subspace enhanced speech (solid line), and harmonic enhanced speech (dashed line). The noise is multitalker babble noise at SNR of 0, 5, 10, 15, and 20 dB.
set at 0, 5, 10, 15, and 20 dB. In both Figs. 13 and 14, the objective measures (PESQ and MBSD) of JND-enhanced speech
are marked by dotted lines and diamond . The measurements
of spectral subtraction are illustrated in dash-dotted lines and
. The measurements of subspace enhancements for
circles
white and colored noise are expressed by solid line and plus
. The performances of the proposed HE method are
signs
.
plotted in dashed line and asterisks
From Figs. 13 and 14 we can see the proposed harmonic enhancement method outperforms the JND approach in [24] and
subspace approach [21], [23] at all SNR conditions for both
white Gaussian noise and babble noise cases. The performance
improvement of the proposed HE method is more profound at
low SNR. In the case of the spectral subtraction method the
PESQ scores suggest that it has the similar good performance as
the proposed harmonic enhancement method in the white noise
case for SNR greater than 5 dB and at SNR equal to 20 dB for
babble noise. However, one should be aware that PESQ is generally deaf to the musical noise introduced by spectral subtraction, an observation also confirmed by subjective listening tests.
In the case of average MBSD scores, the proposed method out-
performed the other three methods by large margins for both

types of noise, particularly for low SNR conditions, but as the
input SNR increases the enhancement performances of all four
methods tend to converge. This is because as the input SNR increases the harmonics of clean voiced speech become less distorted in the noisy speech. Therefore, the benefits of harmonic
enhancement tend to diminish with increasing input SNR. Comparing Figs. 13 with 14, it can be observed that harmonic enhancement is particularly effective for multitalker babble noise
where voiced signals from background speakers often introduce unwanted harmonic regions that need to be suppressed.
That is also evident in the spectrograms of Figs. 1012. Statistical analysis of the PESQ and MBSD results was performed by
examining Fishers F-distribution computed via ANOVA analysis. Results showed that the proposed harmonic enhancement
method improved performance with greater than 99.9% certainty.
Table I lists the results using the composite performance
measures, organized according to the three evaluated measures:
. We can see that the harmonic enhanced
method outperforms the other three methods under white
366
TABLE I
COMPOSITE MEASUREMENT COMPARISONS OF 60 SENTENCES OF JND ENHANCED SPEECH, SPECTRAL SUBTRACTION
ENHANCED SPEECH, SUBSPACE ENHANCED SPEECH, AND HARMONIC ENHANCED SPEECH
Gaussian noise and multitaker babble noise particularly for low

SNR. When SNR is greater than 10, the subspace or spectral
subtraction methods outperform the proposed method. This is
because when speech is clear enough, trying to enhance the
harmonics may actually degrade the quality by introducing
artifacts.
A subjective listening test was conducted to compare the
performance of the proposed harmonic enhancement method
against those of the three other techniques in A-B tests. The
test was performed by a group of 16 listeners (four female, 12
male), aged between 18 and 25 years old, all students at the
University of Miami. The authors were excluded from this test.
The subjects were not familiar with the sentences used in the
test. Three sentences were selected from the TIMIT database
and used to generate babble and white noise-corrupted speech
at SNR of 0 and 10 dB and processed with all four methods.
The resulting total of 18 sentence pairs and six sentence pairs of
original noisy speech for each SNR condition were presented
to each subject through headphones. For each SNR level, the
order of the sentence pairs was randomized and the structure of
the test and the identity of each enhancement method was not
revealed to the listeners. Each subject was asked to compare
the audio clips in each pair and vote indicating their quality
preference. The test was designed to be short in order to avoid
listener fatigue, although the subjects were allowed to listen to
the sentence pair as many times as they needed to make a decision. On average, the subjects voted in favor of the harmonic
approach for 72.66% of the audio clips with standard deviation
of 2.35. ANOVA analysis showed that the results of the test

had statistical significance at levels greater than 99.9%. Sound
examples of noisy speech and its enhancement using the four
methods implemented in this paper are located in our website:
http://www.chronos.ece.miami.edu/dasp/harmonic_speech_enhancement.html.
V. CONCLUSION
In this paper, we have proposed a speech enhancement
method which aims at emphasizing harmonics. The harmonics
are enhanced by processing the degraded speech in both
the time and frequency domains. In contrast to many other
state-of-the-art methods, the proposed algorithm allows a low
level of residual noise in the enhanced speech. The noisy
speech is enhanced in the frequency domain by a spectral
weighting function, which contains two design parameters.
One of the design parameters, namely, the frequency-dependent noise-flooring parameter (FDNFP), is used to emphasize
the harmonics of voiced speech as well as to control the
frequency-dependent level of admissible residual noise. For
voiced speech, the periodicity in the linear prediction residual
signal was detected and enhanced and then transformed to the
frequency domain to be used as the FDNFP. The magnitudes of
the FDNFP are scaled to some small values in order to suppress
the level of residual noise in the enhanced speech. The other
design parameter is the dominant term in the spectral gain function. For voiced frame, it enhances the harmonics by adaptive
comb filtering. The comb filter is implemented in the frequency
domain by utilizing a spectral peak-picking algorithm. For

unvoiced and silent frames, the dominant weighting parameter
reduces to a Wiener-type gain.
The enhancement algorithm was tested on 60 sentences
degraded by white Gaussian and multitalker babble noise
at various input SNR. The enhancement performance was
evaluated in terms of average ITU-PESQ scores, MBSD and
composite objective measures. Three other methods were
implemented and their performance compared against that of
the proposed method. Those were the spectral subtraction [13],
Ephraim and Van Trees signal subspace speech enhancement
method [21] for white noise, the LevAri and Ephraim subspace
method for colored noise [23] and a perceptually based (JND)
enhancement method which employs a constant noise-flooring
parameter [24]. Experimental results indicate that the proposed
Harmonic Enhancement (HE) method outperforms the other
methods particularly at low SNR conditions. In the spectrograms of enhanced speech, the harmonics are more prominent
and overall noise is more suppressed in the HE speech than
the other methods. Subjective listening test also indicated that
the proposed method is generally more preferred. All obtained
results were statistically significant at very high confidence
level.
REFERENCES
[1] J. Hardwick, C. D. Yoo, and J. S. Lim, Speech enhancement using
the dual excitation model, in Proc. IEEE Int. Conf. Acoust., Speech,
Signal Process. (ICASSP), 1993, pp. 367370.
[2] S. Dubost and O. Cappe, Enhancement of speech based on non-parametric estimation of a time varying harmonic representation, in Proc.
IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2000, pp.
18591862.
[3] M. E. Deisher and A. S. Spanias, HMM-based speech enhancement
using harmonic modeling, in Proc. IEEE Int. Conf. Acoust., Speech,
[4] M. E. Deisher and A. S. Spanias, Speech enhancement using statebased estimation and sinusoidal modeling, J. Acoust. Soc. Amer., vol.
102, no. 2, pp. 11411148, Aug. 1997.
[5] J. Jensen and J. H. L. Hansen, Speech enhancement using a constrained iterative sinusoidal model, IEEE Trans. Speech Audio
Process., vol. 9, no. 7, pp. 731740, Oct. 2001.
[6] D. V. Anderson and M. A. Clements, Audio signal noise reduction
using harmonic modeling, in Proc. IEEE Int. Conf. Acoust., Speech,
[7] D. Morgan, B. George, L. Lee, and S. M. Kay, Cochannel speaker
separation by harmonic enhancement and suppression, IEEE Trans.
Speech Audio Process., vol. 5, no. 5, pp. 407424, Sep. 1997.
[8] T. F. Quatieri and R. G. Danisewicz, Cochannel speaker separation by
harmonic enhancement and suppression, IEEE Trans. Acoust., Speech,
Signal Process., vol. 38, no. 1, pp. 5669, Jan. 1990.
[9] A. Erell and M. Weintraub, Estimation of noise-corrupted speech dftspectrum using the pitch period, IEEE Trans. Speech Audio Process.,
vol. 2, pp. 18, Jan. 1994.
[10] A.-T. Yu and H.-C. Wang, New speech harmonic structure measure
and it application to post speech enhancement, in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process. (ICASSP), 2004, pp. 729732.
[11] C. Li and S. V. Anderson, Inter-frequency dependency in mmse
speech enhancement, in Proc. 6th Nordic Signal Process. Symp.,
2004, pp. 200203.
[12] C. Plapous, C. Marro, and P. Scalart, Speech enhancement using harmonic regeneration, in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process. (ICASSP), 2005, pp. 157160.
[13] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27,
no. 2, pp. 113120, Apr. 1979.
[14] V. Grancharov, J. H. Plasberg, J. Samuelsson, and W. B. Kleijn, Generalized postfilter for speech quality enhancement, IEEE Trans. Audio,
Speech, Lang. Process., vol. 16, no. 1, pp. 5764, Jan. 2008.
367
[15] J. H. Chen and A. Gersho, Adaptive postfiltering for quality enhancement of coded speech, IEEE Trans. Speech Audio Process., vol. 3, no.
1, pp. 5971, Jan. 1995.
[16] Y. Ephraim, A minimum mean square error approach for speech enhancement, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 1990, vol. 2, pp. 829832.
[17] Y. Ephraim and D. Malah, Speech enhancement using a minimum
mean-square error log-spectral amplitude estimator, IEEE Trans.
Acoust., Speech, Signal Process., vol. 33, no. 2, pp. 443445, Apr.
1985.
[18] K. K. Paliwal and A. Basu, A speech enhancement method based
on kalman filtering, in Proc. IEEE Int. Conf. Acoust., Speech, Signal
Process. (ICASSP), 1987, pp. 177180.
[19] V. Grancharov, J. H. Plasberg, J. Samuelsson, and W. B. Kleijn,
Speech enhancement using a masking threshold constrained kalman
filter and its heuristic implementations, IEEE Trans. Audio, Speech,
Lang. Process., vol. 14, no. 1, pp. 1932, Jan. 2006.
[20] M. Gabrea, Adaptive kalman filtering-based speech enhancement algorithm, in Proc. Canadian Conf. Elect. Comput. Eng., Fredericton,
AB, Canada, 2001, vol. 1, pp. 521526.
[21] Y. Ephraim and H. L. Van Trees, A signal subspace approach for
speech enhancement, IEEE Trans. Speech Audio Process., vol. 3, pp.
251266, Jul. 1995.
[22] U. Mittal and N. Phamdo, Signal/noise KLT based approach for enhancing speech degraded by colored noise, IEEE Trans. Speech Audio
Process., vol. 8, no. 2, pp. 159167, Mar. 2000.
[23] H. Lev-Ari and Y. Ephraim, Extension of the signal subspace speech
enhancement approach to colored noise, IEEE Signal Process. Lett.,
vol. 10, no. 4, pp. 104106, Apr. 2003.
[24] S. Gustafsson, P. Jax, and P. Vary, A novel psychoacoustically motivated audio enhancement algorithm preserving background noise characteristics, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 1998, pp. 397400.
[25] Y. Hu and P. C. Loizou, Incorporating a psychoacoustical model in
frequency domain speech enhancement, IEEE Signal Process. Lett.,
vol. 11, no. 2, pp. 270273, Feb. 2004.
[26] R. Gray, On the asymptotic eigenvalue distribution of Toeplitz matrices, IEEE Trans. Inf. Theory, vol. IT-18, no. 6, pp. 725730, Nov.
1972.
[27] J. D. Markel, The sift algofithm for fundamental frequency estimation, IEEE Trans. Audio Electroacoust., vol. AU-20, no. 5, pp.
367377, Dec. 1972.
[28] P. Veprek and M. S. Scordilis, Analysis, enhancement and evaluation
of five pitch determination techniques, Speech Commun., vol. 37, pp.
249270, Jul. 2002.
[29] Y. Stylianou, Applying the harmonic plus noise model in concatenative speech synthesis, IEEE Trans. Speech Audio Process., vol. 9, no.
1, pp. 2129, Jan. 2001.
[30] J. D. Johnston, Transform coding of audio signals using perceptual
noise criteria, IEEE J. Sel. Areas Commun., vol. 6, no. 2, pp. 314323,
Feb. 1988.
[31] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech Audio
Process., vol. 9, no. 5, pp. 504512, Jul. 2001.
[32] D. H. Johnson and P. N. Shami, The signal processing information
base, IEEE Signal Process. Mag. vol. 10, no. 4, pp. 3642, Oct. 1993
[Online]. Available: http://www.spib.rice.edu/spib/select_noise.html
[33] Information EechnologyCoding of Audio-Visual ObjectsPart 3:
Audio, ISO/IEC 14496-3:2005, 2005.
[34] Perceptual evaluation of speech quality (PESQ): An objective method
for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, ITU-T Rec. P.862, Feb. 2001 [Online].
Available: http://www.itu.int/rec/T-REC-P.862-200102-I/en, accessed
on Aug. 15, 2008
[35] W. Yang, M. Benbouchta, and R. Yantorno, Performance of the modified bark spectral distortion as an objective speech quality measure,
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP),
1998, pp. 541544.
[36] J. G. Beerends, A. P. Hekstra, A. W Rix, and M. P. Hollier, Perceptual Evaluation of Speech Quality (PESQ): The new ITU standard for
end-to-end speech quality assessment part IIPsychoacoustic model,
J. Audio Eng. Soc., vol. 50, no. 10, pp. 765778, Oct. 2002.
[37] S. Wang, A. Sekey, and A. Gersho, An objective measure for
predicting subjective quality of speech coders, IEEE J. Sel. Areas
Commun., vol. 10, no. 5, pp. 819828, Jun. 1992.
368
[38] W. Yang and R. Yantorno, Comparison of two objective speech quality

measures: MBSD and ITU-T recommendation P.861, in Proc. IEEE
2d Workshop Multimedia Signal Process., Dec. 1998, pp. 426431.
[39] Y. Hu and P. C. Loizou, Evaluation of objective measures for speech
enhancement, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no.
1, pp. 229238, Jan. 2008.
Wen Jin received the M.S. and Ph.D. degrees in electrical and computer engineering from University of
Miami, Coral Gables, FL, in 2001 and 2006, respectively.
His research interests include the general area of
audio and speech processing, especially in the area of
audio and speech coding, and single-channel speech
enhancement. He is now with Qualcomm, Inc.
Xin Liu was born in Beijing, China, on April 21,

1983 . She received the B.S. degree in electrical
engineering from Beijing University of Chemical
Technology (BUCT), Beijing, in 2005. She is currently pursuing the Ph.D. degree in the Department
of Electrical and Computer Engineering, University
of Miami, Coral Gables, FL.
From 2005 to 2007, she was a Software Engineer
(database administrator) with Beijing Guoxin Communication System Co., Ltd. (a subsidiary of China
Telecom). Her research interests are in speech enhancement and speech and audio processing.
Michael S. Scordilis (SM03) received the B.E. degree in communication engineering from the Royal
Melbourne Institute of Technology, Melbourne, Australia, in 1984, and the M.S. degree in electrical engineering and the Ph.D. degree in engineering from
Clemson University, Clemson, SC, in 1986 and 1990,
respectively.
From 1990 to 1995, he was University Lecturer at
the University of Melbourne, Melbourne, Australia.
He has held visiting Senior Researcher positions at
Bell Communications Research (Bellcore), Morristown, NJ, Sun Microsystems Labs, Chelmsford, MA, and the University of
Patras, Patras, Greece. He is now Research Associate Professor of Electrical
and Computer Engineering at the University of Miami, Coral Gables, FL. His
current research interests include signal processing for speech, audio, signal
recovery and enhancement, psychoacoustics, language processing, and multimedia signal processing. He is an active industry consultant in the areas of audio
and speech analysis, recognition and compression, and multimedia services, and
holds patents in those areas. He has published over 60 papers in major journals
and conferences.
Dr. Scordilis received the 2003 Eliahu I. Jury Award for Excellence in Research of the College of Engineering, University of Miami. He is a member of
the Technical Chamber of Greece.
Lu Han received the M. S. degree in electrical engineering from the Harbin Institute of Technology,
Harbin, China, in 2007.
In August 2007, she joined Digital Audio and
Speech Processing Lab at the University of Miami,
Coral Gables, FL, as Research Assistant working
on was speech enhancement. In August 2008, she
transferred to the Department if Electrical and Computer Engineering, North Carolina State University,
Raleigh. Her current research interests include image
processing and computer vision.

Finding Structure in Audio For Music Information Retrieval - Naga Bhaskar

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Finding Structure in Audio For Music Information Retrieval - Naga Bhaskar

Hochgeladen von

Copyright:

Verfügbare Formate

356

Speech Enhancement Using Harmonic Emphasis

AbstractAn enhancement method for single-channel speech

conventional hidden Markov model (HMM)-based minimum

1558-7916/$26.00 2009 IEEE

gain, namely the frequency-dependent noise-flooring parameter

II. PRINCIPLES OF THE PROPOSED ENHANCEMENT METHOD

Suppose we have a single channel of noisy speech degraded

We formulate the speech enhancement task as the following

The diagonal elements of

Therefore, (17) can be viewed as a combination of a gain

III. HARMONIC ENHANCEMENT

Fig. 1. Computation of the frequency-dependent noise flooring parameters.

between excitation peaks by windowing the LP residual signal

in (18) over the entire duration of the analysis

Fig. 3. Harmonic enhancement in linear prediction residuals, (a) speech in 2(a)

Fig. 4. Spectrum of linear prediction residuals and FDNFP, (a) spectrum of

Fig. 5. Interpolation of a single harmonic peak and rejection of spurious peaks.

Fig. 6. Recovery of multiple peaks submerged by noise. Magnitude spectra

where 20 dB denotes an amplitude value of 0.01. Since the

Fig. 8. Spectrum of harmonic enhanced speech, the corresponding spectrum

by accurate estimation of the clean speech spectrum, as will be

Fig. 9. Waveforms of clean, noisy, and harmonic enhanced speech,

in (27) is used for unvoiced and silent frames, and

IV. PERFORMANCE EVALUATION

component, respectively. is a noise-flooring parameter. In our

the subjective quality of speech. A comparison between the

and spectrograms of the clean and noisy speech are depicted

: A composite measure for signal distortion (SIG)

Figs. 1012 show an example of enhancement of noisy

performed the other three methods by large margins for both

Gaussian noise and multitaker babble noise particularly for low

of 2.35. ANOVA analysis showed that the results of the test

domain by utilizing a spectral peak-picking algorithm. For

[38] W. Yang and R. Yantorno, Comparison of two objective speech quality

Xin Liu was born in Beijing, China, on April 21,

Das könnte Ihnen auch gefallen