Sie sind auf Seite 1von 13

Improving Monaural Speaker Identication by Double-Talk Detection

R. Saeidi1 , P. Mowlaee2 , T. Kinnunen1 , Z. -H. Tan2 , M. G. Christensen3 , S. H. Jensen2 , and P. Frnti1 a


Speech and image processing Unit (SIPU), School of Computing, University of Eastern Finland Dept. of Electronic Systems, 3 Dept. of Architecture Design and Media Technology Aalborg University, Denmark
1

Interspeech 2010 Japan, September 2010 Presenter: Zhengh-Hua Tan


R. Saeidi1 , P. Mowlaee2 , T. Kinnunen1 , Z. -H. Tan2 , M. G. Christensen3 , S. H. Jensen2 , and P. Frnti1 1 / 13 Improving Monaural Speaker Identication by Double-Talk Detection a

Presentation Outline

Problem Denition and Background Proposed system Double Talk Detection Speaker Identication Performance Evaluation

R. Saeidi et al.

Monaural Speaker Identication

2 / 13

Monaural Speaker Identication


Problem Denition

Fundamentals
Recognize BOTH of the speakers existing in a MIXED audio le Novelty of this work is including double-talk detector (DTD) as a pre-processor for a previously proposed speaker identication back-end

R. Saeidi, P. Mowlaee, T. Kinnunen, Z. H Tan, M. G. Christensen, S. H. Jensen and P. Frnti, a Signal-to-signal ratio independent speaker identication for co-channel speech signals, IEEE 20th International Conference on Pattern Recognition, ICPR 2010, , pp. 4565-4568, Istanbul, Turkey, August 2010.

R. Saeidi et al.

Monaural Speaker Identication

3 / 13

Monaural Speaker Identication


Motivation

There are SOME studies to recognize BOTH of the speakers, but they need at least TWO microphones [1] There are FEW studies to recognize BOTH of the speakers when we have only ONE microphone [2] Building a stand alone speaker identication system as a computationally less intensive alternative for Super Human Iroquois system [2] Bringing single-talk/double-talk information in frame-level to improve monaural speaker identication
[1] Y. E. Kim, J. M. Walsh, and T. M. Doll, Comparison of a joint iterative method for multiple speaker identication with sequential blind source separation and speaker identication, in Odyssey 2008: The Speaker and Language Recognition Workshop, Jan. 2008. [2] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, Super-human multi-talker speech recognition: A graphical modeling approach, Elsevier Computer Speech and Language, vol. 24, no. 1, pp. 4566, Jan 2010.

R. Saeidi et al.

Monaural Speaker Identication

4 / 13

System structure

Figure: The block diagram of the proposed system.

R. Saeidi et al.

Monaural Speaker Identication

5 / 13

Double Talk Detection


Assume that we have K candidate models denoted by Mk (i.e. M0 , M1 , and M2 ), for describing monaural speech signal. We adopt a maximum a posteriori (MAP) criterion for multiple-hypothesis test to determine double-talk/single-talk regions in segments of a mixed signal. Given the mixed signal, select the model which has the the maximum a posteriori (MAP) probability. We apply dierent policies in speaker identication for mixed and single-talker frames. M0 : None of the speakers is active, M1 : One of the speakers is active, M2 : Both of the speakers are active.
R. Saeidi et al. Monaural Speaker Identication 6 / 13

Double Talk Detection Performance


4 3 Mixed signal Ground truth Estimated boudaries by DTD

Mixed signal

2 1 0 1 0 4 3

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Speaker 1

Speaker 1 signal Ground truth Estimated boudaries by DTD

2 1 0 1 0 4 3 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Speaker 2

Speaker 2 signal Ground truth Estimated boudaries by DTD

2 1 0 1 0 0.2 0.4 0.6 0.8 1 Time (sec) 1.2 1.4 1.6 1.8

Figure: Double-talk detection results for mixture of male and female mixed at 3 dB SSR. (Labels are -1 for no speech, 0 for mixed signal,1 for speaker 1 and 2 for speaker 2.)
R. Saeidi et al. Monaural Speaker Identication 7 / 13

Speaker Identication
Frame Level Likelihood (FLL)

The main idea is to use mixed speech domain GMM models We use here T0 frames of input feature stream which are recognized to be mixed speech We compute FLL as: sigt = log[p(xt |ig )] log[p(xt |U BM )] (1) Finding the most probable speaker for each frame, we count number of winning frames per speaker and normalize it

Figure: ig as the model for the ith speaker at SSR level g.


R. Saeidi et al. Monaural Speaker Identication 8 / 13

Speaker Identication
Kullback-Leibler divergence (KLD)

We use here T0 frames of input feature stream which are recognized to be mixed speech We compute KLD as: KLDig =
1 2 M m=1 wm (me

mig )T 1 (me mig ) (2) m

KLD scores averaged over SSR levels, g, and then normalized


R. Saeidi et al. Monaural Speaker Identication 9 / 13

Speaker Identication
Score Fusion

For T0 frames of input feature stream which are recognized to be mixed speech we form the score per speaker as: score = 0.5 KLD + 0.5 FLL For T1 (T2 ) frames of input which are recognized to belong to speaker 1 (2), we pass them to KLD module to nd the best match idx is the identied speaker from single-talk frames, we add a bonus score to its decision score as: score[idx] = score[idx] + T1 /T (or T2 /T )
R. Saeidi et al. Monaural Speaker Identication 10 / 13

Evaluation Corpus
Grid corpus
Number of sentences per talker: 1000 Number of speakers: 34 (18 male and 16 female) Corpus size: 34,000 Number of distinct sentences: 2048 Files duration: typically 1-2 sec

Figure: The Speech Separation Challenge.


R. Saeidi et al. Monaural Speaker Identication 11 / 13

Speaker Identication
Results

Table: Speaker identication performance (% error) where both speakers are correctly found in the top-3 list. Yes/No indicates whether the proposed DTD method is included. For the ST scenario both systems provide no error.
SG DTD SSR -9 dB -6 dB -3 dB 0 dB 3 dB 6 dB Average No 7.26 3.35 0.56 1.68 2.79 6.15 3.64 Yes 6.70 3.35 0.56 1.68 2.23 5.59 3.35 No 17.50 6.00 2.50 1.00 6.50 9.50 7.17 DG Yes 13.03 5.00 2.00 2.00 5.00 10.50 6.17 Average No Yes 8.00 3.00 1.00 0.83 3.00 5.00 3.47 5.32 2.29 0.61 0.61 1.89 4.37 2.57

R. Saeidi et al.

Monaural Speaker Identication

12 / 13

Conclusion

Successful ideas from speaker verication is applied for monaural speaker identication Mixed speech with dierent SSRs used to train speaker GMMs Speaker models are created by MAP adaptation rather than conventional ML train Double talk detection introduced to enhance speaker identication system performance

MATLAB code
will be made available in my webpage: cs.joensuu./pages/saeidi contact: rahim@cs.joensuu.

R. Saeidi et al.

Monaural Speaker Identication

13 / 13

Das könnte Ihnen auch gefallen