Sie sind auf Seite 1von 4

IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO.

5, MAY 2013 475

Voice Activity Detection Via Noise Reducing


Using Non-Negative Sparse Coding
Peng Teng and Yunde Jia

Abstract—This letter presents a voice activity detection (VAD) II. AUDIO SIGNAL ANALYSIS USING NON-NEGATIVE
approach using non-negative sparse coding to improve the detec- SPARSE CODING
tion performance in low signal-to-noise ratio (SNR) conditions.
The basic idea is to use features extracted from a noise-reduced Let be the magnitude spec-
representation of original audio signals. We decompose the mag- trum of an audio signal with time frames, where
nitude spectrum of an audio signal on a speech dictionary learned denotes the magnitude
from clean speech and a noise dictionary learned from noise of the -th time frame; is the frequency-bin index, and
samples. Only coefficients corresponding to the speech dictionary
. can be approximated by a linear combination of an
are considered and used as the noise-reduced representation
of the signal for feature extraction. A conditional random field over-complete set of bases with weights ,
(CRF) is used to model the correlation between feature sequences where is called a
and voice activity labels along audio signals. Then, we assign dictionary, and is a coefficient vector.
the voice activity labels for a given audio by decoding the CRF. Denoting , the magnitude spectrum of the
Experimental results demonstrate that our VAD approach has a signal can be decomposed under the non-negativity con-
good performance in low SNR conditions.
straint and using non-negative matrix
Index Terms—Conditional random fields, noise reducing, non- factorization (NMF) as . NMF gives “parts” based
negative sparse coding, voice activity detection. representation of the signal as only additive combinations
of the bases in are allowed in the approximation. NMF with
a typical sparseness constraint on is namely non-negative
I. INTRODUCTION sparse coding (NSC) [9]. NSC is an attractive middle-level
signal representation method for noise-robust feature extrac-
tion. It can be achieved by minimizing the distance between the
signal and its approximation:
V OICE activity detection (VAD) is used to detect the pres-
ence of speech in an audio signal. It plays an important
role in numerous modern speech communication systems. In the
(1)
last decade, since Sohn et al. [1] proposed a VAD algorithm with
impressive performance in 1999, there have been many vari- where and are the estimated optimal values of and ,
ants of VAD focusing on approaches using statistical models respectively; denotes the Frobenius norm; is the
[2]–[4]. Regarding VAD as a binary classification problem, re- -th elements of , and the non-negative constant controls
searchers employed some feature extraction methods as well as the sparsity of . (1) is unusually subject to , so
classifiers based on statistical learning theory in their VAD ap- that the elements of provide a power-based representation
proaches [5]–[7]. For example, You et al. [7] proposed a VAD on . Note that NSC is equivalent to NMF when . The
algorithm based on sparse coding technique, aiming to improve dictionary can be learned according to (1) using the NSC
the noise-robustness of features for speech detection, and Saito algorithm proposed in [10]. The decomposition of into on a
et al. [5] developed a VAD system based on conditional random given dictionary can be achieved by using the same algorithm
field (CRF) [8] using multiple popular features. However, most but with a fixed . Additionally, we define as the
of these approaches used features extracted directly from rep- reconstruction of , and as the residual.
resentations of the mixture of speech and noise. The capability
of the features for speech/pause discrimination might be seri- III. SPARSE REPRESENTATION FOR VAD
ously degraded in low signal-to-noise ratio (SNR) conditions.
To mitigate the degradation, we propose a VAD approach via We aim to extract features for speech detection from a noise-
noise reducing using non-negative sparse coding in which fea- reduced representation of original audio signals. Since the mag-
tures for speech detection are extracted from a noise-reduced nitude spectrum of an audio signal is approximately the sum
representation of original audio signals. of speech magnitude spectrum and noise magnitude spec-
trum [11], can be decomposed as
Manuscript received January 19, 2013; revised March 04, 2013; accepted
March 07, 2013. Date of publication March 14, 2013; date of current version
(2)
March 28, 2013. The associate editor coordinating the review of this manuscript
and approving it for publication was Prof. Jeronimo Arenas-Garcia. where and denote the contributions of speech and noise
The authors are with Beijing Lab of Intelligent Information Technology and
the School of Computer Science, Beijing Institute of Technology, Beijing, China in the magnitude spectrum, respectively; denotes a speech
(e-mail: tengpeng@bit.edu.cn; jiayunde@bit.edu.cn). dictionary (with bases) which is over-complete and learned
Digital Object Identifier 10.1109/LSP.2013.2252615 from clean speech signals using NSC, in order to obtain noise-

1070-9908/$31.00 © 2013 IEEE


476 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 5, MAY 2013

robust bases for representing speech; denotes a noise dic-


tionary (with bases, ) which is low-rank and
learned from noise signal samples using NMF, so that it can well
fit the noise magnitude spectrum with the few bases; and
are the coefficient matrices corresponding to and ,
respectively, and is desired to be sparse. In the ideal case Fig. 1. Graphical model representation of CRF for VAD.
(e.g., the bases from the two dictionaries are distinctly dissim-
ilar), and can reflect the true contributions of the bases where the two kinds of feature functions are transition feature
from the two dictionaries in constructing : functions defined as
(3) (8)

If is discarded, the noise contribution is supposed to be re- and observation feature functions defined as
duced away from . Therefore, under the assumption described
in (3), only is considered in our VAD approach and used as (9)
the noise-reduced representation of speech in , independently
of the noise. Then, feature vectors to describe the speech activity
with parameters . Note that the model with tied
are extracted from . At the -th time frame , we extract a
parameters is used across all cliques, in order to seamlessly
feature vector from consisting of handle sequences of arbitrary length. Given a fully labeled
three statistics of coefficients in , i.e., MAX, mean square training set , the CRF parameters is esti-
root and mean: mated by maximizing the conditional log-likelihood:

(10)
(4)
With a trained CRF, letting denote for simplicity,
the best activity label sequence conditioned on a given fea-
for measuring the presence of speech. ture vector sequence is estimated by decoding the CRF, i.e.,
solving
IV. VAD CONTEXT MODELING BASED ON CRF (11)
The goal of a VAD task is to give a sequence of voice activity
labels along a given audio signal The decoding is usually achieved using Viterbi algo-
, where indicates the speech absence or pres- rithm, obtaining a hard decision of (as shown in
ence at the -th time frame . Let be an observed feature [5]). However, we employ the Forward-Backward al-
vector derived from , and correspondingly gorithm to calculate where
be an observed feature vector sequence along . We model the serving as the a poste-
correlation between and using CRFs with a linear chain riori of speech presence at the -th time frame where controls
structure, i.e., first-order state dependency depicted in Fig. 1. In the range of the context that is considered. Then, we actually
the linear chain, the cliques include pairs of neighboring labels obtain an activity label sequence determined by a threshold
and feature-label pairs . Let exponentiated :
feature functions be the positive-valued potential functions
of the cliques. Given an observation with time frames and (12)
parameters , the distribution over a label sequence can be ,
defined as so that a trade-off between detection probability and false alarm
probability of VAD can be easily made by tuning .
(5)
V. EXPERIMENTS
TIMIT [12] corpus with its word transcription is used in the
(6) experiments for the VAD performance evaluation. Four typical
noise sources from NOISEX-92 [13] corpus, including the F-16,
factory, white and babble noises, are selected for the simulations
where is the observation dependent normalization. In of real noisy environments. We randomly select 128 sentences,
our approach, is computed, in terms of weighted sums over of which 8 sentences (excluding the two dialects) were spoken
the features of the cliques, by by each of 16 speakers from TIMIT TEST set. 64 sentences from
half of the speakers are concatenated as a long utterance with si-
lence of random length (from 1 to 3 seconds) inserted between
every pair of adjacent sentences, and the remaining 64 sentences
are concatenated in the same way. These two long utterances are
(7) about 338 seconds and 331 seconds long and with 51.6% and
50.3% of speech signals, respectively. The first long utterance
TENG AND JIA: VOICE ACTIVITY DETECTION VIA NOISE REDUCING USING NON-NEGATIVE SPARSE CODING 477

is added with the white noise at dB serving as the


only training utterance of the CRF. The second is added with
various types of noise for the evaluation of VAD performance.
The noise adding and SNR estimating are implemented using
FaNT [14] with its default P.341 filter for 16 kHz sampled data,
in order to simulate audio signals recorded in real noise envi-
ronments.
For an input audio signal, it is down-sampled at 8 kHz, and a
256-point Short Time Fourier Transform (STFT) is performed
with the analysis window of length 32 ms and the window shift
of 16 ms. The magnitude spectrum is the magnitude square
values of the low 128 dimensions of the STFT output. The
speech dictionary is learned with parameters of
and from 3696 sentences, of which 8 sentences were
spoken by each of 462 speakers from the TIMIT TRAIN set. For
a noisy utterance, assuming speech does not present in its very
beginning, is initially learned with parameters of
from the beginning 1024 ms signal, then re-learned from the
latest speech residual (its negative elements
are forced to zero) in every subsequent 1024 ms. is estimated
on and . extracted from is used as the input of the
CRF. is calculated with , and is determined by a
given . We apply an optional smoothing post-processing with
the constraint that the detected speech presence/absence dura-
tions are longer than 16 time-frames.
can be estimated by decomposing directly on a con-
catenated dictionary using NSC as

(13) Fig. 2. Noise-robustness comparison of estimated in different methods. (a)


Magnitude spectrum of a clean utterance with voice activity labels. (b) Magni-
tude spectrum with factory noise added at dB. (c) Normalized feature
values in the simple NSC method. (d) Normalized feature values in Method-1.
Since is also desired to be sparse (e) Normalized feature values in Method-2.
like , so that the sparseness regularization is still reasonable.
We refer to this method as Method-1. In the case of heavy noise, sec, respectively. However, Method-2 over-reduced the noise
can be estimated in a two-step method (called Method-2) as around 19.5 sec.
follows. Firstly, is decomposed on the noise dictionary For training the CRF independently of noise, of the
using NMF: training utterance is estimated by using the simple NSC method
according to (16). Then, the feature vector sequence extracted
(14)
from the resulting with its transcription is used in the
training.
Then, is estimated by decomposing the residual The detection probability and false alarm probability of our
(the negative elements of are forced to zero) on the VAD approach (estimating using Method-1 or Method-2,
speech dictionary using NSC as with or without smoothing) are exploited to evaluate the per-
formance. Sohn’s VAD [1], G.729B and our approach using
(15) the simple NSC method for estimating are taken as three
references. The ROC curves of these approaches under dif-
ferent noises at dB are illustrated in Fig. 3, where
Method-2 can achieve better noise reducing as it gives priority
“Ours(I)” and “Ours(II)” are our approach using Method-1 and
to fitting noise. However, it may rather over-reduce noise when
Method-2, respectively. We also compare the ROC curves of
speech power is low. We compare estimated in the two
these VAD approaches in high- and medium-SNR conditions
methods on their noise-robustness shown by their respective
( dB and dB) as that illustrated in Fig. 4.
feature vectors . In addition, as a reference, is also es-
It is demonstrated that our approach further improves the VAD
timated by decomposing only on the speech dictionary
performance in low SNR conditions, and is effective in high-
using NSC:
and medium-SNR conditions. Comparatively speaking, our
(16) approach is not suitable to deal with the babble noise, because
the bases in the babble dictionary are similar to those in the
speech dictionary as they are all derived from human speech. In
named the simple NSC method. The comparison is made on the white noise, the simple NSC method for evaluating is a little
factory noise utterance, and illustrated in Fig. 2. Note that, com- better than ours. The reason is that reducing white noise is one
pared with the simple NSC method, Method-1 and Method-2 of the instincts of sparse coding, and the simple NSC method
succeed in reducing the heavy noise around 16.8 sec and 21.3 evaluates in a direct way, avoiding the possible confusion
478 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 5, MAY 2013

Fig. 4. ROC curves of the VAD approaches under F-16 noise in high- and
medium-SNR (20 dB and 5 dB, respectively) conditions.

speech dictionary and a noise dictionary. Only the coefficients


according to speech bases have been considered and used as the
noise-reduced representation. A CRF with a linear chain struc-
ture have been constructed to model the correlation between ac-
tivity label sequences and feature sequences along audio signals,
and it has been decoded in a soft decision fashion. The experi-
mental results have demonstrated that our approach further im-
proves the performance of VAD in low SNR conditions.

ACKNOWLEDGMENT
The authors thank Prof. Xiangjian He for his English correc-
tions to this manuscript.

REFERENCES
[1] J. Sohn, N. Kim, and W. Sung, “A statistical model-based voice activity
detection,” IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1–3, 1999.
[2] Y. Cho, K. Al-Naimi, and A. Kondoz, “Improved voice activity detec-
tion based on a smoothed statistical likelihood ratio,” in Proc. Int. Conf.
Acoustics, Speech, and Signal Processing, 2001, vol. 2, pp. 737–740.
[3] J. Ramírez, J. Segura, J. Górriz, and L. García, “Improved voice activity
detection using contextual multiple hypothesis testing for robust speech
recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no.
8, pp. 2177–2189, 2007.
[4] Y. Suh and H. Kim, “Multiple acoustic model-based discriminative
likelihood ratio weighting for voice activity detection,” IEEE Signal
Process. Lett., vol. 19, no. 8, pp. 507–510, 2012.
[5] A. Saito, Y. Nankaku, A. Lee, and K. Tokuda, “Voice activity detection
based on conditional random fields using multiple features,” in Proc.
Interspeech, 2010, pp. 2086–2089.
[6] J. Wu and X. Zhang, “Efficient multiple kernel support vector machine
Fig. 3. ROC curves of the VAD approaches under (a) F-16, (b) factory, (c) based voice activity detection,” IEEE Signal Process. Lett., vol. 18, no.
white and (d) babble noises at SNR dB, where “Ours(I)”, “Ours(II)” and 8, pp. 466–469, 2011.
“Simple NSC” denote our VAD approach using Method-1, Method-2 and the [7] D. You, J. Han, G. Zheng, and T. Zheng, “Sparse power spectrum based
simple NSC method for estimating , respectively. robust voice activity detector,” in Proc. Int. Conf. Acoustics, Speech,
and Signal Processing, 2012, pp. 289–292.
[8] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:
between the bases from speech and noise dictionaries. Our Probabilistic models for segmenting and labeling sequence data,” in
Proc. Int. Conf. Machine Learning, 2001, pp. 282–289.
Method-2 performs better than Method-1 in the medium- and [9] P. Hoyer, “Non-negative sparse coding,” in Proceedings of the IEEE
low-SNR F-16 noise conditions, but worse in the high-SNR Workshop on Neural Networks for Signal Processing, 2002, pp.
due to over-noise-reduction. In high-SNR conditions, the per- 557–565, IEEE.
formance of our approach is degraded by the high possibility of [10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix
factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp.
over-noise-reduction which is raised from both the inaccurate 19–60, 2010.
evaluation of and the noise-insensitivity caused by the [11] K. W. Wilson, B. Raj, P. Smaragdis, and A. Divakaran, “Speech
very-low SNR training utterance for CRF. denoising using nonnegative matrix factorization with priors,” in
Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 2008, pp.
4029–4032.
VI. CONCLUSION [12] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N.
Dahlgren, “DARPA TIMIT acoustic-phonetic continous speech
We have proposed a VAD approach using NSC to improve corpus CD-ROM,” in NIST, 1993.
[13] [Online]. Available: http://spib.rice.edu/spib/select_noise.htmlR. Uni-
the detection performance by using a noise-reduced represen- versity, NOISEX-92 Database, [Online.] Available:
tation for feature extraction. We have decomposed the magni- [14] H.-G. Hirsch, FaNT: Filtering and Noise Adding Tool [Online]. Avail-
tude spectrum of an audio signal into coefficients on a clean able: http://aurora.hsnr.de

Das könnte Ihnen auch gefallen