Sie sind auf Seite 1von 4

Empirical Mode Decomposition VAD based on

Multiple Sensor LRT


Theodoros Petsatodis #1 , Fotios Talantzis #2 , Christos Boukis 3
#

Athens Information Technology


0.8km Markopoulou Ave. Peania Athens, Greece
1

thp@es.aau.dk
2
fota@ait.gr

Accenture Interactive
Athens 14564, Greece
christos.boukis@accenture.com

AbstractVoice Activity Detection (VAD) remains a challenging task given its dependence on adverse noise and reverberation
conditions. The problem becomes even more difcult when the
microphones used to detect speech reside far from the speaker. In
this paper, an unsupervised VAD scheme is presented, based on
the Empirical Mode Decomposition (EMD) analysis framework
and a multiple input likelihood ratio test (LRT). The highly
efcient method of EMD relies on local characteristics of time
scale of the data to analyse and decompose non-stationary signals
into a set of so called intrinsic mode functions (IMF). These
functions are injected to the multiple input LRT scheme in
order to decide upon speech presence or absence. To minimize
mis-detections and enhance the performance of the hypothesis
test, a computationally efcient forgetting scheme along with an
adaptive threshold are also employed. Simulations, conducted
in several articial environments, illustrate that signicant improvements can be expected, in terms of performance, from the
proposed scheme when compared to similar VAD systems.

I. I NTRODUCTION
Voice Activity Detection (VAD) is a core speech processing
technology with application in several domains. Integrated in
several telecommunication systems, is used to reduce power
consumption of transmitters and bandwidth utilization [1].
VAD is often merged with other speech-processing systems,
like Automatic Speech Recognition and Speaker Identication,
to prevent their operation in the absence of speech [2].
Typically, VAD systems rely on continuous observation of
a specic metric to decide on the content of audio signals.
Such metrics can be the energy levels, zero-crossing rate,
periodicity, linear prediction coding parameters, and mutual
information [1], [3], [4], [5]. Recently introduced statistical
VADs attempt to mathematically formulate the problem, by
employing Likelihood Ratio Test (LRT) as a decision criterion
on top of the Fourier processed framed input [6], [7], [8].
The performance of VAD systems depends on various
factors, including the discriminative ability of the classication
criterion employed, the dynamics of the additive noise, speaker
movement, the signal to noise ratio and reverberation. The task
of VAD becomes more difcult when far-eld microphones
are used to capture voice. Furthermore, speech generation is a
non-linear and non-stationary process. Especially during fast
transitions between phonemes and voicing states, it can be

considered highly non-stationary. Thus, its analysis, when employing methods such as Fourier, is conducted under specic
assumptions of linearity and stationarity that can potentially
lead in performance reduction. The stationarity requirement is
not particular to the Fourier analysis, as it is a general case
for most of the available data analysis methods.
Towards overcoming such adversities, related research focused on a very powerful analysis framework namely Empirical Mode Decomposition (EMD) [9]. This highly efcient
method relies on local characteristics of time scale of the data
to analyse and decompose non-stationary signals into a set of
so called intrinsic mode functions (IMF).
Based on EMD analysis a VAD system has been presented
in [10]. By extracting entropy-based features from the resulting
IMFs, the experiments have shown that the method was superior to the entropy extracted from original speech, especially
under intensive noise. In [11] the signal was rst decomposed
employing EMD, and then the results were processed by
Hilbert transform (HT) to obtain the instantaneous frequency.
The threshold of noise was estimated by analysing the front
of signals Hilbert amplitude spectrum. The speech segments
and non-speech segments were distinguished by the threshold
and the whole signals Hilbert amplitude spectrum resulting
in very good performance under low SNR.
In this paper, an alternative VAD algorithm based on the
EMD decomposition is considered. In order to decide upon
speech presence or absence, the multiple microphone LRT
VAD presented in [12] is employed, driven by the IMFs
that emerge from decomposition process. In order to improve
results, an adaptive threshold along with a computationally
efcient smoothing scheme are also used in the system.
The paper is organized as follows. In Section II EMD
analysis is summarized. In Section III the proposed VAD is
described in detail. Section IV discusses presents the experiments performed. Section V concludes the work.
II. E MPIRICAL M ODE D ECOMPOSITION
Through EMD any complicated data set can be decomposed
into a nite and often small number of intrinsic mode functions
that admit well-behaved HT. The method is adaptive, and

therefore, highly efcient. Furthermore, being based on the


local characteristic time scale of the data, makes it applicable
to non-linear and non-stationary signals [9] like speech.
Assuming a single speaker, the speech signal captured by a
distant microphone at time t is given by
x(t) = h(t) s(t) + n(t)

(2)

The procedure of shifting, shown above, has to be repeated


more times in order to better approximate the rst IMF. In the
second sifting process, h1 is treated as the data and thus
h1 (t) m11 (t) = h11 (t).

(3)

The procedure is repeated d times, until h1d is an IMF


h1(d1) (t) m1d (t) = h1d (t),

(4)

Then, it is designated as c1 (t) = h1d (t) ,the rst IMF component from the data. To guarantee that the IMF components
retain enough physical sense of both amplitude and frequency
modulations, the size of the standard deviation, SD, is limited,
computed from two consecutive sifting results, in order to stop
the sifting process.

 
T
h1(d1)(t) h1d (t)2

SD =
(5)
h21(d1) (t)
t=0
Typical value for SD can be set between 0.2 and 0.3.
Overall, c1 should contain the nest scale or the shortest period
component of the signal. The separation of c1 from the rest
of the data is given by
x(t) c1 (t) = r1 (t).

(6)

Since the residue, r1 , still contains information of longer


period components, it is treated as the new data and subjected
to the same sifting process as described above.
r1 (t) c2 (t) = r2 (t), , ri1 (t) ci (t) = rI (t).

ci (t) + rI (t).

(8)

i=1

where I the number of total IMFs that emerged from the


decomposition process.
III. M ERGING EMD WITH M ULTIPLE M ICROPHONE VAD

(1)

where s(t) denotes the source speech signal at time t, h(t)


the corresponding acoustic impulse response, n(t) the additive
noise, and denotes convolution.
Given x(t), all maxima are identied and then interpolated
by using cubic spline curve to dene the upper envelope of
the signal. In the same way, the lower envelope is dened
for the minima. The mean value function of the upper and
lower envelope is dened as m1 (t), and then the rst signal
component can be calculated as
h1 (t) = x(t) m1 (t).

x(t) =

I


(7)

The sifting process can be stopped by either when the


component, ci , or the residue, rI , becomes so small that it is
less than the predetermined value of substantial consequence,
or when the residue, rI ,becomes a monotonic function from
which no more IMFs can be extracted. Finally, the decomposed data into n-empirical modes, and a residue, rI , which
can be either the mean trend or a constant can be recomposed
to the initial signal by

The multiple microphone VAD system developed in [12]


serves as the platform to merge VAD and EMD. Although
for the scope of this work, the multiple microphone signals
are substituted by the corresponding I IMFs (8). In essence,
the signal x(t) is rst decomposed with EMD into a set of
IMF signals ci (t) that are treated as additional recordings of a
microphone array. The trend rI is not included in the process.
Following [12], VAD is expressed as the likelihood ratio of
two hypotheses stating speech presence and absence for each
IMF ci (t). Assuming additive noise the two hypotheses H1,i
and H0,i that indicate speech presence and speech absence are
accordingly:
H0,i : speech absence : Xi (t) = Ni (t)
H1,i : speech presence : Xi (t) = Si (t) + Ni (t)

(9)
(10)

where Xi (t) = [X0,i (t), X1,i (t), ..., XK1,i (t)] , Si (t) =

T

S0,i (t),S1,i (t), ..., SK1,i (t) , Ni (t) = N0,i (t), N1,i (t),
T
..., NK1,i (t) are the noisy captured speech, reverberated
speech, and noise frequency components for the ith IMF
ci (t) with K the total number of frequency bins.
Real and imaginary parts of noise and speech frequency
spectrum are assumed to be zero mean Gaussian distributed for
every IMF. The probability densities for the noise and speech
components with k denoting the frequency bin are given by
fn,i (Nk,i (t)) =

12
2
2n,k,i


12
2
fs,i (Sk,i (t)) = 2s,k,i
e
2
n,k,i
,

Nk,i (t)2
2 2
n,k,i

S
(t)2
k,i2
2
s,k,i

(11)
(12)

2
s,k,i

s,k,i =
the slowly varying
where n,k,i =
variances of the Gaussian distributed noise and speech respectively estimated by employing eqn.(16) for the k th frequency
component of the ith IMF. The probability density functions
conditioned on H0,i and H1,i are given by


K1

1
|Xk,i |2
e
p(Xk,i |H0,i ) =
(13)
n,k,i
n,k,i
k=0


K1

|Xk,i |2
1
p(Xk,i |H1,i ) =
.
e
[n,k,i + s,k,i ]
n,k,i + s,k,i
k=0
(14)
In the case of single microphone VAD scheme the likelihood
ratio for the kth frequency bin of the ith IMF is dened as
k,i

p(Xk,i |H1,i )
1
=

e
p(Xk,i |H0,i )
1 + k,i

k,i k,i
1 + k,i


(15)

where k,i s,k,i /n,k,i and k,i |Xk,i |2 /n,k,i the a


priori and a posteriori signal to noise ratios estimated by

employing the Predicted Estimation (PD) method [6]




n,k,i (t + 1) = n
n,k,i (t) + (1 n )E |Nk,i (t)| |Xk,i (t)




s,k,i (t + 1) = s s,k,i (t) + (1 s )E |Sk,i (t)|2 |Xk,i (t) (16)


2

s,k,i (t),
n,k,i (t) are estimates of s,k,i (t), n,k,i (t)
where
and n ,s are smoothing parameters both set to 0.99.
The decision is drawn through the geometric mean of the
likelihood ratios for the individual frequencies of every IMF

all other solutions. Additionally, EMD with spectral entropy


performs better that EMD with HHT just for up to 10dB.
This is due to the fact that the overall performance of the
former system is subject to Pf . The single microphone LRT
is the worst performer depicting the advantage of performance
enhancement with EMD prior likelihood testing.
30
EMD MMLRT
SMLRT
EMD + HHT
EMD + SpEnt

25

{k,i log (k,i ) 1} .

(17)

20

k=0

Thus, the LRT across all IMFs components will be transformed to


I K1
H1


1
(18)
EMD

=
{

log
(
)

1}
k,i
k,i
log
IK i=1
H0
k=0

1
K

P (%)

log k,i =

K1


10

where denotes the threshold of decision presented in [7].


In order to enhance the performance of the hypothesis tests
the following forgetting scheme is employed
(t) = (1 EMD
)(t 1) + EMD
log EMD
log (t)
log
log

15

0
5

10

15

20

SNR (dB)

Fig. 1.

Pe performance under different intensities of White noise

(19)

where EMD
= 0.9 the smoothing factor and (t) the
log
smoothed likelihood.

30
EMD MMLRT
SMLRT
EMD + HHT
EMD + SpEnt

25

IV. P ERFORMANCE D ISCUSSION

P (%)

20

15

10

0
5

10

15

20

SNR (dB)

Fig. 2.

Pe performance under different intensities of babble noise

25
EMD MMLRT
SMLRT
EMD + HHT
EMD + SpEnt
20

15

P (%)

Speech Detection (Pc ), Non-speech Detection (Pf ) and


Average Detection (Pe ) error rates [12], were evaluated using
the speech recordings performed in the anechoic chamber of
Aalborg University Denmark using a close talking microphone
[7]. 13 participants (7 males and 6 females) were recorded at
16kHz, speaking at mother-languages for approximately 15
min each, reading sentences and words presented to them
with random pause intervals. 8 different languages appear in
the data set. Recordings performed also in English for 15
additional minutes under the same pattern. Speech intervals
occupy half of the recording time. The recordings have been
annotated manually.
Speech data were articially contaminated with white vehicular and babble noises from NOISEX-92 database [13]. The
microphone array data were articially generated using the
Image Method [14] for a reverberation time of T60 = 0.15sec
and room dimensions [4.4, 5.8, 2.6]m. Speaker location was
2.5m away from the linear array. The input data were sampled
at 8 kHz and were segmented into overlapping frames of 40
msec duration (10 msec step size).
The performance of the proposed system is compared to the
single microphone LRT VAD SM-VAD presented in [12] and
to the systems proposed in [10], [11] denoted as EMD+HHT
and EMD + SpEnt respectively. Same frame/step sizes have
been used for all systems.
Figure 1 depicts how the performance of the proposed
system varies as SNR drops due to AWGN. In this graph, it is
shown that the performance of the proposed system surpasses

10

0
5

10

15

20

SNR (dB)

Fig. 3.

Pe performance under different intensities of vehicular noise

TABLE I
P ERFORMANCE R ESULTS UNDER VARIOUS T YPES OF N OISE
Noise

AWGN

Babble

Vehicle

SNR
20dB
15dB
10dB
5dB
0dB
-5dB
20dB
15dB
10dB
5dB
0dB
-5dB
20dB
15dB
10dB
5dB
0dB
-5dB

EMD MM-LRT
Pe %
Pc %
Pf %
1.94
2.32
1.65
1.85
2.77
0.93
2.05
2.09
2.01
3.59
4.64
2.55
5.61
6.62
4.61
9.77
10.75
8.78
2.23
1.89
2.57
3.00
3.14
2.86
5.26
5.52
5.00
8.44
7.16
9.73
14.16
9.98
18.34
21.18
15.36
26.99
2.18
1.98
2.38
1.77
2.25
1.29
3.82
3.10
4.54
5.18
5.01
5.35
9.87
7.77
11.96
11.07
9.81
12.33

Pe %
7.85
13.89
16.50
19.64
22.23
25.13
7.25
9.75
13.63
18.81
22.74
26.85
6.56
8.55
14.74
18.53
22.97
24.69

SM-LRT
Pc %
Pf %
6.46
9.24
19.97
7.81
19.80
13.20
18.58
20.71
19.64
24.82
22.46
27.81
6.01
8.49
7.38
12.12
9.94
17.32
14.73
22.89
19.41
26.07
23.08
30.63
6.26
6.85
9.45
7.65
11.72
17.77
19.57
17.49
21.33
24.62
20.05
29.34

In babble noise (Fig. 2) case, the conclusions are similar


to those for AWGN. The proposed system performs better
than the rest in almost all cases. EMD with spectral entropy
is slightly better than EMD+HHT system and especially for
the case of -5dB it performs better than the proposed solution.
The performance for all systems drops signicantly faster with
SNR dropping, compared to the case of AWGN.
In the case of car noise (Fig. 3), the performance of the
proposed system is again above the rest of the systems. The
performance of the EMD+HHT system is better than the EMD
with spectral entropy for almost all cases apart from the case
of 10dB. Table I depicts simulation results in detail.
V. C ONCLUSIONS
An efcient VAD based on Empirical Mode Decomposition
(EMD) analysis framework has been presented. By relying on
local characteristics of time scale of the data this method is
ideal to analyse and decompose non-stationary signals, such as
speech, into a set of so called intrinsic mode functions (IMF).
These functions were used as a substitute of microphone array
signals, injected to a multiple microphone likelihood ratio
based VAD scheme, in order to decide upon speech presence
or absence. To minimize mis-detections and enhance the
performance of the hypothesis test, a computationally efcient
forgetting scheme along with an adaptive threshold have been
also employed. Through simulations we have demonstrated
that the proposed system remains more robust than a set of
related counterparts under intensive background noise.
R EFERENCES
[1] A. Benyassine, E. Shlomot, H. Su, D. Massaloux, C. Lamblin, and
J. Petit, ITU-T recommendation g. 729 annex b: a silence compression
scheme for use with g. 729 optimized for v. 70 digital simultaneous
voice and data applications, Communications Magazine, IEEE, vol. 35,
no. 9, pp. 6473, 1997.
[2] R. Gemello, F. Mana, and R. Mori, Non-linear estimation of voice
activity to improve automatic recognition of noisy speech, in Ninth
European Conference on Speech Communication and Technology, 2005.
[3] F. Talantzis and A. Constantinides, Using information theory to detect
voice activity, in Acoustics, Speech and Signal Processing, 2009.
ICASSP 2009. IEEE International Conf. on, 2009, pp. 46134616.

EMD + HHT
Pe %
Pc %
3.01
3.91
4.80
6.42
6.30
6.23
9.53
10.56
11.65
16.12
16.03
19.32
5.87
5.32
6.43
7.11
8.23
9.28
12.36
12.79
18.09
17.54
22.09
20.52
4.24
4.51
4.92
4.03
8.64
6.37
10.10
8.79
11.78
9.44
14.31
13.18

[9]
Pf %
2.12
3.19
6.38
8.51
7.18
12.74
6.42
5.76
7.19
11.93
18.64
23.66
3.97
5.82
10.91
11.42
14.13
15.44

EMD
Pe %
3.35
4.44
6.07
10.81
13.30
17.87
4.39
6.77
10.79
12.06
17.78
18.44
4.66
5.37
6.59
11.45
13.83
15.39

+ SpEnt.
Pc %
1.39
1.01
7.95
9.45
11.29
16.72
2.34
5.17
9.78
10.01
18.04
16.81
3.18
4.31
5.05
9.49
10.99
12.34

[10]
Pf %
5.31
7.88
4.19
12.18
15.32
19.03
6.45
8.37
11.81
14.12
17.52
20.08
6.51
6.44
8.13
13.42
16.67
18.44

[4] R. Tucker, Voice activity detection using a periodicity measure, in


Communications, Speech and Vision, IEE Proceedings I, vol. 139, no. 4.
IET, 1992, pp. 377380.
[5] K. Sakhnov, E. Verteletskaya, and B. Simak, Dynamical energybased speech/silence detector for speech enhancement applications, in
Proceedings of the World Congress on Engineering, vol. 1, 2009.
[6] J. Chang, N. Kim, and S. Mitra, Voice activity detection based on
multiple statistical models, Signal Processing, IEEE Transactions on,
vol. 54, no. 6, pp. 19651976, 2006.
[7] T. Petsatodis, C. Boukis, F. Talantzis, Z.-H. Tan, and R. Prasad, Convex
combination of multiple statistical models with application to vad,
Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19,
no. 8, pp. 2314 2327, nov. 2011.
[8] A. Davis and R. Togneri, Statistical voice activity detection using
low-variance spectrum estimation and an adaptive threshold, IEEE
Transactions on Audio, Speech and Language Processing, vol. 14, pp.
412424, March 2006.
[9] N. Huang, Z. Shen, S. Long, M. Wu, H. Shih, Q. Zheng, N. Yen,
C. Tung, and H. Liu, The empirical mode decomposition and the
hilbert spectrum for nonlinear and non-stationary time series analysis,
Proceedings of the Royal Society of London. Series A: Mathematical,
Physical and Engineering Sciences, vol. 454, no. 1971, pp. 903995,
1998.
[10] X. Tan, J. Gu, H. Zhao, and Z. Tao, A noise robust endpoint detection
algorithm for whispered speech based on empirical mode decomposition
and entropy, in Intelligent Information Technology and Security Informatics (IITSI), 2010 3rd International Symposium on. IEEE, 2010, pp.
355359.
[11] Z. Lu, B. Liu, and L. Shen, Speech endpoint detection in strong noisy
environment based on the hilbert-huang transform, in Mechatronics and
Automation, 2009. ICMA 2009. International Conference on. IEEE,
2009, pp. 43224326.
[12] T. Petsatodis, F. Talantzis, C. Boukis, Z. Tan, and R. Prasad, Multisensor voice activity detection based on multiple observation hypothesis
testing, in INTERSPEECH11, 12th Annual Conference of the International Speech Communication Association, 2011, pp. 26332636.
[13] A. Varga and H. Steeneken, Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of
additive noise on speech recognition systems, Speech Communication,
vol. 12, no. 3, pp. 247251, 1993.
[14] J. Allen and D. Berkley, Image method for efciently simulating smallroom acoustics, J. Acoust. Soc. Am, vol. 65, no. 4, pp. 943950, 1979.

Das könnte Ihnen auch gefallen