Double Talk 1

Improving Monaural Speaker Identication by Double-Talk Detection
R. Saeidi1 , P. Mowlaee2 , T. Kinnunen1 , Z. -H. Tan2 , M. G. Christensen3 , S. H. Jensen2 , and P. Frnti1 a

Speech and image processing Unit (SIPU), School of Computing, University of Eastern Finland Dept. of Electronic Systems, 3 Dept. of Architecture Design and Media Technology Aalborg University, Denmark
1
Interspeech 2010 Japan, September 2010 Presenter: Zhengh-Hua Tan

R. Saeidi1 , P. Mowlaee2 , T. Kinnunen1 , Z. -H. Tan2 , M. G. Christensen3 , S. H. Jensen2 , and P. Frnti1 1 / 13 Improving Monaural Speaker Identication by Double-Talk Detection a
Presentation Outline
Problem Denition and Background Proposed system Double Talk Detection Speaker Identication Performance Evaluation
R. Saeidi et al.
Monaural Speaker Identication
2 / 13

Problem Denition
Fundamentals
Recognize BOTH of the speakers existing in a MIXED audio le Novelty of this work is including double-talk detector (DTD) as a pre-processor for a previously proposed speaker identication back-end
R. Saeidi, P. Mowlaee, T. Kinnunen, Z. H Tan, M. G. Christensen, S. H. Jensen and P. Frnti, a Signal-to-signal ratio independent speaker identication for co-channel speech signals, IEEE 20th International Conference on Pattern Recognition, ICPR 2010, , pp. 4565-4568, Istanbul, Turkey, August 2010.
R. Saeidi et al.
3 / 13

Motivation
There are SOME studies to recognize BOTH of the speakers, but they need at least TWO microphones [1] There are FEW studies to recognize BOTH of the speakers when we have only ONE microphone [2] Building a stand alone speaker identication system as a computationally less intensive alternative for Super Human Iroquois system [2] Bringing single-talk/double-talk information in frame-level to improve monaural speaker identication
[1] Y. E. Kim, J. M. Walsh, and T. M. Doll, Comparison of a joint iterative method for multiple speaker identication with sequential blind source separation and speaker identication, in Odyssey 2008: The Speaker and Language Recognition Workshop, Jan. 2008. [2] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, Super-human multi-talker speech recognition: A graphical modeling approach, Elsevier Computer Speech and Language, vol. 24, no. 1, pp. 4566, Jan 2010.
R. Saeidi et al.
4 / 13
System structure
Figure: The block diagram of the proposed system.
R. Saeidi et al.
5 / 13
Double Talk Detection

Assume that we have K candidate models denoted by Mk (i.e. M0 , M1 , and M2 ), for describing monaural speech signal. We adopt a maximum a posteriori (MAP) criterion for multiple-hypothesis test to determine double-talk/single-talk regions in segments of a mixed signal. Given the mixed signal, select the model which has the the maximum a posteriori (MAP) probability. We apply dierent policies in speaker identication for mixed and single-talker frames. M0 : None of the speakers is active, M1 : One of the speakers is active, M2 : Both of the speakers are active.
R. Saeidi et al. Monaural Speaker Identication 6 / 13
Double Talk Detection Performance

4 3 Mixed signal Ground truth Estimated boudaries by DTD
Mixed signal
2 1 0 1 0 4 3
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
Speaker 1
Speaker 1 signal Ground truth Estimated boudaries by DTD
2 1 0 1 0 4 3 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Speaker 2
Speaker 2 signal Ground truth Estimated boudaries by DTD
2 1 0 1 0 0.2 0.4 0.6 0.8 1 Time (sec) 1.2 1.4 1.6 1.8
Figure: Double-talk detection results for mixture of male and female mixed at 3 dB SSR. (Labels are -1 for no speech, 0 for mixed signal,1 for speaker 1 and 2 for speaker 2.)
Speaker Identication
Frame Level Likelihood (FLL)
The main idea is to use mixed speech domain GMM models We use here T0 frames of input feature stream which are recognized to be mixed speech We compute FLL as: sigt = log[p(xt |ig )] log[p(xt |U BM )] (1) Finding the most probable speaker for each frame, we count number of winning frames per speaker and normalize it
Figure: ig as the model for the ith speaker at SSR level g.

Kullback-Leibler divergence (KLD)
We use here T0 frames of input feature stream which are recognized to be mixed speech We compute KLD as: KLDig =
1 2 M m=1 wm (me
mig )T 1 (me mig ) (2) m
KLD scores averaged over SSR levels, g, and then normalized

Score Fusion
For T0 frames of input feature stream which are recognized to be mixed speech we form the score per speaker as: score = 0.5 KLD + 0.5 FLL For T1 (T2 ) frames of input which are recognized to belong to speaker 1 (2), we pass them to KLD module to nd the best match idx is the identied speaker from single-talk frames, we add a bonus score to its decision score as: score[idx] = score[idx] + T1 /T (or T2 /T )
Evaluation Corpus
Grid corpus
Number of sentences per talker: 1000 Number of speakers: 34 (18 male and 16 female) Corpus size: 34,000 Number of distinct sentences: 2048 Files duration: typically 1-2 sec
Figure: The Speech Separation Challenge.

Results
Table: Speaker identication performance (% error) where both speakers are correctly found in the top-3 list. Yes/No indicates whether the proposed DTD method is included. For the ST scenario both systems provide no error.
SG DTD SSR -9 dB -6 dB -3 dB 0 dB 3 dB 6 dB Average No 7.26 3.35 0.56 1.68 2.79 6.15 3.64 Yes 6.70 3.35 0.56 1.68 2.23 5.59 3.35 No 17.50 6.00 2.50 1.00 6.50 9.50 7.17 DG Yes 13.03 5.00 2.00 2.00 5.00 10.50 6.17 Average No Yes 8.00 3.00 1.00 0.83 3.00 5.00 3.47 5.32 2.29 0.61 0.61 1.89 4.37 2.57
R. Saeidi et al.
12 / 13
Conclusion
Successful ideas from speaker verication is applied for monaural speaker identication Mixed speech with dierent SSRs used to train speaker GMMs Speaker models are created by MAP adaptation rather than conventional ML train Double talk detection introduced to enhance speaker identication system performance
MATLAB code
will be made available in my webpage: cs.joensuu./pages/saeidi contact: rahim@cs.joensuu.
R. Saeidi et al.
13 / 13

Double Talk 1

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Double Talk 1

Hochgeladen von

Copyright:

Verfügbare Formate

Improving Monaural Speaker Identication by Double-Talk Detection

R. Saeidi1 , P. Mowlaee2 , T. Kinnunen1 , Z. -H. Tan2 , M. G. Christensen3 , S. H. Jensen2 , and P. Frnti1 a

Interspeech 2010 Japan, September 2010 Presenter: Zhengh-Hua Tan

Monaural Speaker Identication

Monaural Speaker Identication

Monaural Speaker Identication

Monaural Speaker Identication

Monaural Speaker Identication

Figure: The block diagram of the proposed system.

Monaural Speaker Identication

Double Talk Detection

Double Talk Detection Performance

Speaker 1 signal Ground truth Estimated boudaries by DTD

2 1 0 1 0 4 3 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Speaker 2 signal Ground truth Estimated boudaries by DTD

Figure: ig as the model for the ith speaker at SSR level g.

mig )T 1 (me mig ) (2) m

KLD scores averaged over SSR levels, g, and then normalized

Figure: The Speech Separation Challenge.

Monaural Speaker Identication

Monaural Speaker Identication

Das könnte Ihnen auch gefallen