Beruflich Dokumente
Kultur Dokumente
100
50
0
50
5 10 15 20 25
SNR (dB)
100
I. I NTRODUCTION
Research into speech source localization has received
much attention for cyberworld applications including automatic camera steering, online video surveillance and
speaker tracking. One of the widely adopted approaches
for speech source localization is the generalized crosscorrelation (GCC) based time-differences-of-arrival (TDOA)
estimation algorithm [1]. This algorithm computes the interchannel delays by locating the maximum weighted crosscorrelation between each pair of the received signals. While
many different prefilters can be applied, the heuristic-based
phase transform (PHAT) prefilter has been found to perform
very well under practical conditions [2].
As reported in [2], the PHAT prefilter is optimal in the
maximum likelihood (ML) sense in the presence of reverberation. However, this prefilter is not robust to low signal-tonoise ratio (SNR) conditions and as a result, the performance
of direction-of-arrival (DOA) estimation algorithms degrade
with reducing SNR. Figure 1 shows an illustrative example
of this degradation where the mean and standard deviation
This work is supported by the Singapore National Research Foundation Interactive Digital Media R&D Program, under research grant
NRF2008IDM-IDM004-010.
150
77
10
5
0
0
0.1
0.2
0.3
0.4
FREQUENCY (CYCLES/SAMPLE)
0.5
(a)
SUBBAND
1
2
3
4
5
6
0
50
100
150
TIME (SAMPLES)
200
250
(b)
Figure 3.
[after [5]].
y = w + n,
(1)
(2)
78
0.03
Proposed pdf
Joint Histogram
by2 =
yk2 (j),
(8)
M
150
0.04
0.02
0.01
0
1
0.5
0
Parent
0
1 0.5
100
50
0
50
yk (j)B(k)
50
0
Child
Parent
(a)
50 50
bn2 =
0.6745
We note that direct application of (9) is not applicable
for our DOA application. Simulation using (9) exhbits a
degradation in DOA performance and that the bearing errors
are sensitive to the noise variance. This is because the
energy of the speech spectrum varies significantly across
different scales. A poor noise estimation can therefore result
in an inappropriate threshold T . Accordingly, this can lead
to additional unwanted high-frequency noise components.
In view of the above, we should consider the degree of
shrinkage for the wavelets of the speech signals and propose
that the new estimator
bn2 be given as
0
Child
(b)
median(|yk (1)|)
, yk (1) subband H1 .
(10)
c
The performance of the DOA estimation algorithm is therefore dependent on the choice of c.
bn2 =
C. Factor c selection
We determine a suitable value of c that gives rise to good
DOA performance. This can be achieved empirically by
studying how c varies across different speech signals under
different SNR conditions. We first perform denoising using (10), (8) and (5) for 30 speech signals extracted from the
NOIZEUS database [8]. The DOA of the denoised speech
is subsequently estimated using GCC-PHAT. Figures 5(a)
and (b) show the variation of bearing error with c for the
case of SNR = 0 and 5 dB, respectively. As can be seen,
the bearing error first reduces with c after which it then
increases modestly. Accordingly, a good choice of c = 1
can be chosen, i.e.,
median(|yk (1)|)
, yk (1) subband H1 .
(11)
bn2 =
1
b = (b
where y2 is the variance of the noisy wavelets. If one
assumes that Yk (j) has Gaussian distribution, y2 for the
79
70
60
50
40
30
0
be seen, a good choice for c(1) that gives rise to good DOA
performance for the GCC-PHAT is given by c(1) = 0.3
across the SNRs considered. In addition, we note that, for
c(1) = 0.7, a low mean bearing error can be achieved while
its standard deviation is modestly high compared to the case
when c(1) = 0.3. We therefore conclude that c(j) = 0.3
and c(j) = 1, j = 2, . . . , J are good choices for DOA
estimation.
(b)
(a)
bearing error(degree)
bearing error(degree)
70
60
50
40
30
20
10
0
Figure 5. Variation of the mean bearing errors with c for (a) SNR = 0
dB and (b) SNR = 5 dB.
70
c(1) = 0.7
40
c(1) = 0.5
30
20
c(1) = 0.3
10
0 1 2 3 4 5 6 7 8 9 10
SNR(dB)
(a)
Bearing Error(degree)
Bearing Error(degree)
50
c(1) = 0.7
60
c(1) = 0.5
50
40
c(1) = 0.3
30
0 1 2 3 4 5 6 7 8 9 10
SNR(dB)
(b)
Figure 6. Variation of (a) mean and (b) standard deviation of the bearing
error with SNR for different factor c(1).
ry (j)
y (j)
w (j) + n (j)
= PJ
PJ
j=1 y (j)
j=1 y (j)
PJ
PJ
rw (j)( j=1 y (j) j=1 n (j))
=
PJ
j=1 y (j)
PJ
rn (j) j=1 n (j)
+ PJ
,
(13)
j=1 y (j)
=
(14)
where
=
J
X
j
n (j)
X
J
y (j).
(15)
(17)
80
60
50
40
70
Martins approach [10]
without
denoising
Beroutis
approach [9]
30
20 waveletbased
10
denoising
0 1 2 3 4 5 6 7 8 910
SNR(dB)
(a)
Bearing Error(degree)
Bearing Error(degree)
70
60
without
denoising
Beroutis
approach [9]
50
40
waveletbased
denoising
30
V. C ONCLUSION
We presented a novel wavelet-based speech denoising
algorithm for achieving high DOA performance for speech
signals. We estimate the local noise variance which can improve DOA performance further. Simulation results showed
our proposed method outperforms the spectral subtraction
technique under low SNR when the original PHAT algorithm
is not robust to low SNR environments.
0 1 2 3 4 5 6 7 8 910
SNR(dB)
(b)
R EFERENCES
[1] C. Knapp and G. Carter, The generalized correlation method
for estimation of time delay, IEEE Trans. Acoust., Speech
and Signal Process., vol. 24, no. 4, pp. 320327, Aug. 1976.
[2] C. Zhang, D. Florencio, and Z. Y. Zhang, Why does PHAT
work well in low noise, reverberative environments? IEEE
Intl Conf. Acoust., Speech and Signal Process., pp. 2565
2568, Mar.-Apr. 2008.
[3] M. Miller and N. Kingsbury, Image denoising using derotated complex wavelet coefficients, IEEE Trans. Image Process., vol. 17, no. 9, pp. 15001511, Nov. 2008.
81