Beruflich Dokumente
Kultur Dokumente
379
3. Speaker Recognition Experiments 80
NIST SRE2008 tel−tel condition
Baseline
3.1. Experimental set-up Word model
Phone model
60
3.1.1. Task definition
Speaker verification is assessed in one sub-set of the short2- 40
380
NIST SRE2008 tel−tel condition
80 the speech utterance and of the phone classifier. Thus, similarly
Baseline
NAP16
to phonotactic approaches applied to the language recognition
60
NAP32
NAP50
task, TN features for every single language-dependent phonetic
decoder may be obtained resulting in an extended TN vector.
40
On the other hand, the TN-SVM speaker recognition sys-
tem benefits from the application of NAP session variability
compensation. In the future, the TN features with session vari-
Miss probability (in %)
20
ability should be evaluated in a speaker recognition task with
10
stronger cross-channel variability.
5
5. Acknowledgement
2
This work was partially supported by FCT (INESC-ID multian-
1
nual funding) through the PIDDAC Program funds, and also by
0.5
the I-DASH European project (SIP-2007-TP-131703).
0.2
0.1
0.1 0.2 0.5 1 2 5 10 20
False Alarm probability (in %)
40 60 80
6. References
Figure 3: DET curves of the baseline TN-SVM system compared [1] Reynolds, D., Quatieri, T. and Dunn, R., “Speaker veri-
to the application of NAP variability compensation with differ- fication using adapted Gaussian mixture models”, Digital
ent dimensionalities. Signal Processing 10, 19-41, 2000.
[2] Campbell, W. M., Campbell, J. R., Reynolds, D. A.,
System minDCF (x100) EER (%) Singer, E. and Torres-Carrasquillo, P. A. “Support vec-
tor machines for speaker and language recognition”, Com-
NAP 16 6.26 16.26 puter Speech and Language, vol. 20, pp. 210-229, 2006.
NAP 32 6.11 16.23 [3] Ferrer, L., Shriberg, E., Kajarekar, S., Stolcke, A.,
NAP 50 6.20 16.17 Sönmez, K., Venkataraman, A. and Bratt, H., “The contri-
Baseline 6.59 17.33 bution of cepstral and stylistic features to SRIs 2005 NIST
speaker recognition evaluation system”, in Proc. ICASSP
Table 2: Minimum Detection Cost (x100) and EER .
2006, vol. 1, pp. 101-104, Toulouse, 2006.
[4] Stolcke, A., Ferrer, L., Kajarekar, S., Shriberg, E. and
mensionalities would not provide additional benefits. Venkataraman, A., “MLLR transforms as features in
This generalized improvement is not as significant as one speaker recognition”, in Proc. Eurospeech 2005, pp. 2425-
could expect. This is very likely due to the fact that cross- 2428, Lisbon, 2005.
channel variability has less influence in this test condition (both [5] Abad, A. and Luque, J., “Connectionist Transformation
training and testing telephone conversations) than in other con- Network Features for Speaker Recognition”, in Proc. of
ditions where room microphones are used together with tele- Odyssey The Speaker and Language Recognition Work-
phone speech. Thus, it can be said that a moderate degree shop 2010, Brno, 2010.
of channel compensation is achieved (fixed telephone, cellular,
etc..), in addition to speaker session variability compensation. [6] Morgan, N. and Bourlad, H., “An introduction to hy-
In the case of NAP with 32 dimensions, a 7.3% and a 6.4% brid HMM/connectionist continuous speech recognition”,
minDCF and EER relative rate reduction is achieved with re- IEEE Signal Processing Magazine, vol. 12 (3), pp. 25-42,
spect to the baseline system. 1995.
[7] Abrash, V., Franco, H., Sankar, A. and Cohen, M. “Con-
4. Conclusions nectionist Speaker Normalization and Adaptation”, in
Proc. of Eurospeech 1995, pp. 2183-2186, Madrid, 1995.
Recent advances in speaker recognition tasks comprise the use [8] “The NIST year 2008 speaker recognition evaluation
of features derived from speech recognition adaptation tech- plan”, http://www.nist.gov/speech/tests/spk/2008/, 2008.
niques such as MLLR. A novel approach proposed by the au-
thors named Transformation Network features with SVM mod- [9] Solomonoff, A., Campbell, W.M. and Boardman, I., “Ad-
elling allows the extraction of meaningful features for speaker vances in channel compensation for SVM speaker recog-
recognition derived from adaptation techniques used in connec- nition”, in Proc. of ICASSP 2005, pp. 629-632, Philadel-
tionist ANN/HMM speech recognition. Previous encouraging phia, 2005.
results compared to two state of the art baseline detectors moti- [10] Abad, A. and Neto, J., “Incorporating acoustical model-
vated the work in this paper. ing of phone transitions in an hybrid ANN/HMM speech
On the one hand, the importance of automatic speech recog- recognizer”, in Proc. of Interspeech 2008, pp. 2394-2397,
nition for phonetic alignment was evaluated and a considerable Brisbane, 2008.
performance degradation caused by low quality automatic tran- [11] Pellegrini, T. and Trancoso, I., “Error detection in auto-
scriptions was observed. However, it was also shown that a re- matic transcriptions using Hidden Markov Models”, In
markable speaker recognition performance can still be achieved Proc. of Language and Technology Conference, 2009.
with a poor automatic transcriber or even with a simple phonetic [12] Brummer, N., “Focal: Tools for Fusion and Cal-
decoder. Our future work plans comprise the use of the phonetic ibration of automatic speaker detection systems”,
decoder approach to obtain a speaker dependent mapping for http://www.dsp.sun.ac.za/ nbrummer/focal/.
any phonetic network independently of the language – both of
381