Beruflich Dokumente
Kultur Dokumente
thp@es.aau.dk
2
fota@ait.gr
Accenture Interactive
Athens 14564, Greece
christos.boukis@accenture.com
AbstractVoice Activity Detection (VAD) remains a challenging task given its dependence on adverse noise and reverberation
conditions. The problem becomes even more difcult when the
microphones used to detect speech reside far from the speaker. In
this paper, an unsupervised VAD scheme is presented, based on
the Empirical Mode Decomposition (EMD) analysis framework
and a multiple input likelihood ratio test (LRT). The highly
efcient method of EMD relies on local characteristics of time
scale of the data to analyse and decompose non-stationary signals
into a set of so called intrinsic mode functions (IMF). These
functions are injected to the multiple input LRT scheme in
order to decide upon speech presence or absence. To minimize
mis-detections and enhance the performance of the hypothesis
test, a computationally efcient forgetting scheme along with an
adaptive threshold are also employed. Simulations, conducted
in several articial environments, illustrate that signicant improvements can be expected, in terms of performance, from the
proposed scheme when compared to similar VAD systems.
I. I NTRODUCTION
Voice Activity Detection (VAD) is a core speech processing
technology with application in several domains. Integrated in
several telecommunication systems, is used to reduce power
consumption of transmitters and bandwidth utilization [1].
VAD is often merged with other speech-processing systems,
like Automatic Speech Recognition and Speaker Identication,
to prevent their operation in the absence of speech [2].
Typically, VAD systems rely on continuous observation of
a specic metric to decide on the content of audio signals.
Such metrics can be the energy levels, zero-crossing rate,
periodicity, linear prediction coding parameters, and mutual
information [1], [3], [4], [5]. Recently introduced statistical
VADs attempt to mathematically formulate the problem, by
employing Likelihood Ratio Test (LRT) as a decision criterion
on top of the Fourier processed framed input [6], [7], [8].
The performance of VAD systems depends on various
factors, including the discriminative ability of the classication
criterion employed, the dynamics of the additive noise, speaker
movement, the signal to noise ratio and reverberation. The task
of VAD becomes more difcult when far-eld microphones
are used to capture voice. Furthermore, speech generation is a
non-linear and non-stationary process. Especially during fast
transitions between phonemes and voicing states, it can be
considered highly non-stationary. Thus, its analysis, when employing methods such as Fourier, is conducted under specic
assumptions of linearity and stationarity that can potentially
lead in performance reduction. The stationarity requirement is
not particular to the Fourier analysis, as it is a general case
for most of the available data analysis methods.
Towards overcoming such adversities, related research focused on a very powerful analysis framework namely Empirical Mode Decomposition (EMD) [9]. This highly efcient
method relies on local characteristics of time scale of the data
to analyse and decompose non-stationary signals into a set of
so called intrinsic mode functions (IMF).
Based on EMD analysis a VAD system has been presented
in [10]. By extracting entropy-based features from the resulting
IMFs, the experiments have shown that the method was superior to the entropy extracted from original speech, especially
under intensive noise. In [11] the signal was rst decomposed
employing EMD, and then the results were processed by
Hilbert transform (HT) to obtain the instantaneous frequency.
The threshold of noise was estimated by analysing the front
of signals Hilbert amplitude spectrum. The speech segments
and non-speech segments were distinguished by the threshold
and the whole signals Hilbert amplitude spectrum resulting
in very good performance under low SNR.
In this paper, an alternative VAD algorithm based on the
EMD decomposition is considered. In order to decide upon
speech presence or absence, the multiple microphone LRT
VAD presented in [12] is employed, driven by the IMFs
that emerge from decomposition process. In order to improve
results, an adaptive threshold along with a computationally
efcient smoothing scheme are also used in the system.
The paper is organized as follows. In Section II EMD
analysis is summarized. In Section III the proposed VAD is
described in detail. Section IV discusses presents the experiments performed. Section V concludes the work.
II. E MPIRICAL M ODE D ECOMPOSITION
Through EMD any complicated data set can be decomposed
into a nite and often small number of intrinsic mode functions
that admit well-behaved HT. The method is adaptive, and
(2)
(3)
(4)
Then, it is designated as c1 (t) = h1d (t) ,the rst IMF component from the data. To guarantee that the IMF components
retain enough physical sense of both amplitude and frequency
modulations, the size of the standard deviation, SD, is limited,
computed from two consecutive sifting results, in order to stop
the sifting process.
T
h1(d1)(t) h1d (t)2
SD =
(5)
h21(d1) (t)
t=0
Typical value for SD can be set between 0.2 and 0.3.
Overall, c1 should contain the nest scale or the shortest period
component of the signal. The separation of c1 from the rest
of the data is given by
x(t) c1 (t) = r1 (t).
(6)
ci (t) + rI (t).
(8)
i=1
(1)
x(t) =
I
(7)
(9)
(10)
where Xi (t) = [X0,i (t), X1,i (t), ..., XK1,i (t)] , Si (t) =
T
S0,i (t),S1,i (t), ..., SK1,i (t) , Ni (t) = N0,i (t), N1,i (t),
T
..., NK1,i (t) are the noisy captured speech, reverberated
speech, and noise frequency components for the ith IMF
ci (t) with K the total number of frequency bins.
Real and imaginary parts of noise and speech frequency
spectrum are assumed to be zero mean Gaussian distributed for
every IMF. The probability densities for the noise and speech
components with k denoting the frequency bin are given by
fn,i (Nk,i (t)) =
12
2
2n,k,i
12
2
fs,i (Sk,i (t)) = 2s,k,i
e
2
n,k,i
,
Nk,i (t)2
2 2
n,k,i
S
(t)2
k,i2
2
s,k,i
(11)
(12)
2
s,k,i
s,k,i =
the slowly varying
where n,k,i =
variances of the Gaussian distributed noise and speech respectively estimated by employing eqn.(16) for the k th frequency
component of the ith IMF. The probability density functions
conditioned on H0,i and H1,i are given by
K1
1
|Xk,i |2
e
p(Xk,i |H0,i ) =
(13)
n,k,i
n,k,i
k=0
K1
|Xk,i |2
1
p(Xk,i |H1,i ) =
.
e
[n,k,i + s,k,i ]
n,k,i + s,k,i
k=0
(14)
In the case of single microphone VAD scheme the likelihood
ratio for the kth frequency bin of the ith IMF is dened as
k,i
p(Xk,i |H1,i )
1
=
e
p(Xk,i |H0,i )
1 + k,i
k,i k,i
1 + k,i
(15)
n,k,i (t + 1) = n
n,k,i (t) + (1 n )E |Nk,i (t)| |Xk,i (t)
s,k,i (t),
n,k,i (t) are estimates of s,k,i (t), n,k,i (t)
where
and n ,s are smoothing parameters both set to 0.99.
The decision is drawn through the geometric mean of the
likelihood ratios for the individual frequencies of every IMF
25
(17)
20
k=0
=
{
log
(
)
1}
k,i
k,i
log
IK i=1
H0
k=0
1
K
P (%)
log k,i =
K1
10
15
0
5
10
15
20
SNR (dB)
Fig. 1.
(19)
where EMD
= 0.9 the smoothing factor and (t) the
log
smoothed likelihood.
30
EMD MMLRT
SMLRT
EMD + HHT
EMD + SpEnt
25
P (%)
20
15
10
0
5
10
15
20
SNR (dB)
Fig. 2.
25
EMD MMLRT
SMLRT
EMD + HHT
EMD + SpEnt
20
15
P (%)
10
0
5
10
15
20
SNR (dB)
Fig. 3.
TABLE I
P ERFORMANCE R ESULTS UNDER VARIOUS T YPES OF N OISE
Noise
AWGN
Babble
Vehicle
SNR
20dB
15dB
10dB
5dB
0dB
-5dB
20dB
15dB
10dB
5dB
0dB
-5dB
20dB
15dB
10dB
5dB
0dB
-5dB
EMD MM-LRT
Pe %
Pc %
Pf %
1.94
2.32
1.65
1.85
2.77
0.93
2.05
2.09
2.01
3.59
4.64
2.55
5.61
6.62
4.61
9.77
10.75
8.78
2.23
1.89
2.57
3.00
3.14
2.86
5.26
5.52
5.00
8.44
7.16
9.73
14.16
9.98
18.34
21.18
15.36
26.99
2.18
1.98
2.38
1.77
2.25
1.29
3.82
3.10
4.54
5.18
5.01
5.35
9.87
7.77
11.96
11.07
9.81
12.33
Pe %
7.85
13.89
16.50
19.64
22.23
25.13
7.25
9.75
13.63
18.81
22.74
26.85
6.56
8.55
14.74
18.53
22.97
24.69
SM-LRT
Pc %
Pf %
6.46
9.24
19.97
7.81
19.80
13.20
18.58
20.71
19.64
24.82
22.46
27.81
6.01
8.49
7.38
12.12
9.94
17.32
14.73
22.89
19.41
26.07
23.08
30.63
6.26
6.85
9.45
7.65
11.72
17.77
19.57
17.49
21.33
24.62
20.05
29.34
EMD + HHT
Pe %
Pc %
3.01
3.91
4.80
6.42
6.30
6.23
9.53
10.56
11.65
16.12
16.03
19.32
5.87
5.32
6.43
7.11
8.23
9.28
12.36
12.79
18.09
17.54
22.09
20.52
4.24
4.51
4.92
4.03
8.64
6.37
10.10
8.79
11.78
9.44
14.31
13.18
[9]
Pf %
2.12
3.19
6.38
8.51
7.18
12.74
6.42
5.76
7.19
11.93
18.64
23.66
3.97
5.82
10.91
11.42
14.13
15.44
EMD
Pe %
3.35
4.44
6.07
10.81
13.30
17.87
4.39
6.77
10.79
12.06
17.78
18.44
4.66
5.37
6.59
11.45
13.83
15.39
+ SpEnt.
Pc %
1.39
1.01
7.95
9.45
11.29
16.72
2.34
5.17
9.78
10.01
18.04
16.81
3.18
4.31
5.05
9.49
10.99
12.34
[10]
Pf %
5.31
7.88
4.19
12.18
15.32
19.03
6.45
8.37
11.81
14.12
17.52
20.08
6.51
6.44
8.13
13.42
16.67
18.44