Beruflich Dokumente
Kultur Dokumente
Anis Ben Aicha and Soa Ben Jebara Unit de recherche TECHTRA e Ecole Sup rieure des Communications de Tunis, 2083 Cit El-Ghazala/Ariana, TUNISIA e e
anis ben aicha@yahoo.fr,sofia.benjebara@supcom.rnu.tn
Abstract
This paper deals with the importance of identifying and selecting the adequate speech denoising technique for a considered application. The selection depends on the desired quality. Three types and degradations are considered : the overall degradation, the speech distortion and the residual background noise. Different popular techniques of speech enhancement using or not perceptual tools are considered. Different criteria of quality evaluation well correlated with subjective tests and ITU-T recommendations are used. The analysis carried during this paper showed that, in terms of overall quality and residual noise evaluation, perceptual techniques lead to the better performances. In terms of speech distortion, the best performances are obtained with spectral subtraction and statistical techniques. Key words: speech enhancement, evaluation criteria, different degradations.
which reduces one selected kind of degradation. To attend this objective, many overall measures and others separating degradations are used. Well focus on objective criteria recently introduced to mimic the listening tests, which are naturally able to identify the kind of degradation. This paper is organized as follows. Section 2 denes the different kinds of degradation affecting denoised speech. In section 3, we give a brief overview of tested denoising techniques. Section 4 and section 5 are reserved to the presentation of the used criteria for the overall quality, speech distortion and residual noise evaluation. In section 6, we argue experimental results.
1. Introduction
The problem of enhancing speech degraded by background noise is a research topic that has received great deal of attention over the past few decades. Many techniques are presented since early years (see for example [1] for a tour). The techniques can be broadly divided into many categories according to their conception principle and the tools they use. We relate for example spectral subtraction based techniques, statistical model based techniques, subspace techniques, perceptual based techniques,... For any speech communication application such as handsfree communications, voice over IP, hearing aids, answering machines, teleconferencing systems, car and mobile phones, cockpits and noise manufacturing,.., where speech denoising is indispensable, a preliminary deep thinking must carried to nd an answer to an unavoidable question: which speech denoising technique should be used? To answer it, we should also think about which kind of speech quality we want to obtain. Classically, users look for a good overall listening quality which can be measured using many objective speech quality measures. These criteria evaluate the overall quality of denoised speech by a single score which embedded all kinds of degradation (speech distortion, residual background noise, musical noise, clipping,...). Nowadays, novel tendencies in quality evaluation are oriented towards better precision of judgement concerning the type of perceived degradation. In fact, during listening tests, some applications tolerate a lowered background noise level, while others tolerate slightest distortions caused by denoising process. In this paper, we aim comparing many speech denoising approaches. The purpose is to nd the best denoising technique
where s(t) is the clean speech signal and n(t) is the noise signal, they are assumed to be uncorrelated. Due to the shorttime stationarity property of speech, the processing is done on a frame-by-frame basis. The Short Time Fourier Transform (STFT) is used and the previous model is re-written Y (m, k) = S(m, k) + N (m, k), (2)
where m (resp. k) denotes the frame index (resp. the frequency index). Speech denoising aims nding an estimation of short time spectral amplitude of original speech, denoted here |S(m, k)|. Classically, it is obtained using a denoising lter H(m, k) |S(m, k)| = H(m, k)|Y (m, k)|. (3)
Various considerations are taken into account to design H(m, k), we relate for example, the relative amount of noise measured in term of signal to noise ratio, the estimated noise spectral amplitude, only the audible part of the noise, masking the audible residual noise,... Once the lter design criterion is dened, the lter can be expressed and determined. For each dened lter, inevitably, different kinds of degradations are introduced, which we propose to explain in next subsection. 2.2. Degradation separation Generally, speech and noise are assumed uncorrelated. Thus, the Power Spectrum Density (PSD) of the error between clean
and denoised speech (m, k) is given by (m, k) = [H(m, k) 1]2 S (m, k) +H(m, k)2 N (m, k), (4)
where S (m, k) (resp. N (m, k)) denotes the speech PSD (resp. the noise PSD). This quantity (m, k) traduces the overall degradation. Its rst term expresses the attenuation of clean speech frequency components. In fact, since H(m, k) is used to reduce the quantity of noise in the observation signal, its amplitude is less to one. Consequently, such degradation is perceptually heard as clean speech distortion. The second term of Eq. 4 expresses the residual noise. It can be perceptually heard as a background noise or a musical noise. Since, it is additive, it is possible to assume it as a term of speech frequency components accentuation.
lated with MOS, are proposed to estimate speech quality with low cost. In this paper, we used Modied Bark Spectrum Distortion (MBSD) [10], Perceptual Evaluation of Speech Quality (PESQ) [8], recent composite criteria (COVL ) [9] and Perceptual Signal to Audible Noise and Distortion ratio (PSANDR) [12]. The correlation coefcients between these criteria and MOS are presented in Tab. 1 [13].
Table 1: Correlation coefcient between overall objective criteria and subjective score MOS. MBSD PESQ COVL PSANDR MOS 0.51 0.67 0.68 0.78
In our simulations, we used a clean speech extracted from TIMIT database (She had your dark suit in greasy wash water all year), sampled at 8 kHz. It was articially corrupted with white Gaussian noise and babble noise, at different values of Signal to Noise Ratio (denoted SNR). The mentioned denoising techniques are applied and quality is evaluated using the already cited overall criteria. The results are resumed in Tab. 3.
Table 2: Correlation coefcient between separate degrdations mesaures and their related subjective scores. CSIG PSADR CBAK PSANR CSIG 0.48 0.44 CBAK 0.72 0.75
6. results
In term of performances, the spectral subtraction is computationally efcient and has a simple mechanism to control the trade-off between speech distortion and residual noise, but suffers from a notorious artifact known as musical noise. The statistical model based techniques reduces musical noise and dont eliminate it completely. They have a moderate computation load, but have no mechanism to control the trade-off between speech distortion and residual noise. The perceptual based techniques have greater complexity, they improve considerably the intelligibility of speech. The trade-off between speech distortion and residual noise is imposed and no mechanism to control such trade-off is designed. Experimental results of objective evaluation related to different kinds of degradation: overall quality, speech distortion and residual noise are summarized respectively in table 3, table 4 and table 5. These tables permit the following interpretations. Overall quality evaluation The best results are obtained with techniques which use perceptual tools independently of the noise nature: white or babble. This can be explained by the fact that these techniques make a trade-off between speech distortion and residual noise. This fact improves the overall quality of the denoised speech. In the case of white noise, the PPP leads to better performances especially in low SNR. This is can be explained by the fact that in this case musical tones are numerous and the PPP success to detect and eliminate them. However, in high SNR levels, performances of perceptual techniques are almost close. Indeed, performances of these techniques depend greatly on the accuracy of the masking threshold (MT) estimation. In the case of high SNR levels, the MT is well estimated and hence the perceptual techniques operate well. In the case of babble noise, the PPP is no longer the best one. This is can be explained by the fact that the babble noise structure is so close to speech structure. So the PPP, which is based on the detection of musical tones, risks to wrongly detecting and eliminating musical tones. Distortion evaluation The less distorted signal is the noisy one. This is a logic result since noisy speech is constituted of the intact version of speech and the background noise. So, there is no reason which leads to degrade speech signal. When the noisy is enhanced, the low level of distortion is obtained with spectral subtraction and statistic methods. This catch up with listening tests. In fact, speech is not highly distorted but the amount of residual noise, especially musical noise, is very high. Residual noise evaluation The best reduction of the background noise is obtained with perceptual techniques. The same remarks concerning the overall quality evaluation are available for the residual noise evaluation. This can be interpreted as follows. The impact of residual noise on the overall quality evaluation is more important than the impact of the speech distortion.
adequate in terms of less speech distortion. However, when regarding the overall quality and residual noise, perceptual techniques performs better than the others.
8. References
[1] P. C. Loizou, Speech enhancement: theory and practise, Prentice Hall, Englewod Cliffs, 1988. [2] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 113-120; 1997. [3] Y. Ephraim, D. Malah, Speech enhancement using a mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust., Speech, Signal Processing, vol 32, pp 1109-1121, 1984 [4] S. Gustafsson, R. Martin, P. Jax and P. Valery, A psychoacoustic approach to combined acoustic echo cancellation and noise reduction, IEEE Trans. Speech and Audio Processing, vol. 10, no. 5, pp. 245-256, 2002. [5] A. Ben Aicha, S. Ben Jebara, Perceptual musical noise reduction using critical bands tonality coefcients and masking thresholds, in Proc. INTERSPEECH, AntwerpBelgium, August 2007. [6] Y. Hu, P. Loizou, Incorporating a psychoacoustical model in frequency domain speech enhancement, IEEE Signal Processing Letters, pp 270-273, 2004. [7] ITU-T P.800, Subjective assessement methods of the transmission quality, ITU-T Recommendation P.800, 1996. [8] ITU-T P.862, Perceptual evvaluation of speech quality (P ESQ), and objective method for end-to-end speech quality assessment of nerrowband telephone networks and speech codecs, ITU-T Recommendation P.862, 2000. [9] Y. Hu, and P.C. Loizou, Evaluation of objective measures for speech enhancement, in Proc. International Conference on Spoken Language Processing ICSLP, USA, 2006. [10] W. Yang, M. Benbouchta and R. Yantorno, Performance of the modied bark spectral distortion as an objective speech measure, in Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, vol 1, pp. 541-544, 1998. [11] ITU-T P.835, Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm, ITU-T Recommendation P.835, 2003. [12] A.Ben Aicha and S. Ben Jebara, Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications, in Advances in Nonlinear Speech Processing Book, pp.230-245, edited by Springer Berlin, December 2007. [13] A.Ben Aicha and S. Ben Jebara, Speech perceptual quality measures separating speech distortion and additive noise degradations, submitted for publication at Speech Communications Journal, June 2009.
7. Conclusion
In this paper, we have tested several denoising techniques which use or not perceptual tools. Our purpose was to nd which method is the more adequate for one of 3 kinds of degradations: overall quality, speech distortion and residual noise. We found that spectral subtraction and statistical methods are the more
SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou MBSD 1.65 1.20 0.84 0.45 0.29 0.42 0.27 PESQ 1.65 1.71 1.76 1.58 1.68 1.81 0.91 MBSD 2.30 1.26 0.76 0.22 0.14 0.49 0.29 PESQ 1.26 1.45 1.67 1.96 2.02 1.44 0.84
0 dB COV L 0.07 0.22 0.63 0.85 1.08 0.59 0.45 0 dB COV L 1.13 0.76 0.92 0.27 0.63 1.31 0.34
Table 3: Overall quality evaluation (a) White noise 10 dB PSANDR MBSD PESQ COV L 1.85 1.07 1.67 0.84 2.68 0.56 2.18 1.24 2.87 0.35 2.39 1.58 4.74 0.11 2.50 1.56 4.94 0.07 2.55 1.75 2.91 0.23 2.30 1.65 4.32 0.07 2.25 1.44 (b) Babble noise 10 dB PSANDR MBSD PESQ COV L 1.79 0.78 2.19 1.89 2.66 0.56 2.30 1.67 2.66 0.39 2.33 1.80 4.05 0.21 2.31 1.42 4.12 0.13 2.356 1.62 2.89 0.19 2.69 2.33 4.77 0.10 2.12 1.63 Table 4: Distortion evaluation
20 dB PESQ COV L 2.45 1.86 2.84 2.16 3.01 2.40 2.95 2.15 2.97 2.32 3.09 2.69 3.03 2.49 20 dB COV L 2.70 2.52 2.64 2.37 2.55 3.02 2.83
0 dB CSIG PSADR -1.00 16.94 -0.74 14.67 -0.19 13.73 0.02 7.00 0.33 6.44 -0.15 10.52 -1.64 5.25
white 10 dB CSIG PSADR 0.06 18.89 0.48 14.35 0.92 13.17 0.83 8.43 1.09 8.05 1.04 14.51 0.69 7.59
20 dB CSIG PSADR 1.28 20.50 1.58 14.91 1.86 14.21 1.49 11.19 1.74 10.79 2.30 14.81 1.95 13.88
0 dB CSIG PSADR 0.80 18.61 0.24 14.78 0.45 14.54 -0.42 7.05 0.02 6.58 0.99 10.81 -0.03 6.18
Babble 10 dB CSIG PSADR 1.71 19.34 1.35 15.00 1.52 14.05 0.93 8.09 1.18 7.87 2.08 10.55 1.25 8.67
20 dB CSIG PSADR 2.58 20.47 2.27 15.15 2.42 14.26 2.00 10.09 2.26 9.99 2.88 14.11 2.52 13.46
Table 5: Residual noise evaluation SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou 0 dB CBAK PSANR 1.64 -3.73 1.70 2.39 1.94 3.79 2.04 17.57 2.22 19.02 2.04 4.02 1.62 14.39 white 10 dB CBAK PSANR 2.26 1.87 2.50 7.89 2.70 9.48 2.64 19.83 2.83 21.15 2.77 9.16 2.65 19.25 20 dB CBAK PSANR 3.14 8.92 3.30 13.99 3.46 15.06 3.31 23.74 3.43 24.96 3.52 19.12 3.57 21.71 0 dB CBAK PSANR 1.72 -4.14 1.46 2.27 1.57 2.26 1.22 12.44 1.53 12.90 1.95 3.88 1.54 17.78 Babble 10 dB CBAK PSANR 2.45 1.73 2.32 7.02 2.41 7.17 2.15 14.62 2.36 15.21 2.79 11.52 2.47 15.68 20 dB CBAK PSANR 3.30 9.50 3.18 13.07 3.28 13.44 3.09 21.27 3.26 21.54 3.55 14.04 3.55 15.66