Sie sind auf Seite 1von 4

SOME REFELEXIONS AND RESULTS ABOUT ADEQAUTE SPEECH DENOISING TECHNIQUE ACCORDING TO DESIRED QUALITY

Anis Ben Aicha and Soa Ben Jebara Unit de recherche TECHTRA e Ecole Sup rieure des Communications de Tunis, 2083 Cit El-Ghazala/Ariana, TUNISIA e e
anis ben aicha@yahoo.fr,sofia.benjebara@supcom.rnu.tn

Abstract
This paper deals with the importance of identifying and selecting the adequate speech denoising technique for a considered application. The selection depends on the desired quality. Three types and degradations are considered : the overall degradation, the speech distortion and the residual background noise. Different popular techniques of speech enhancement using or not perceptual tools are considered. Different criteria of quality evaluation well correlated with subjective tests and ITU-T recommendations are used. The analysis carried during this paper showed that, in terms of overall quality and residual noise evaluation, perceptual techniques lead to the better performances. In terms of speech distortion, the best performances are obtained with spectral subtraction and statistical techniques. Key words: speech enhancement, evaluation criteria, different degradations.

which reduces one selected kind of degradation. To attend this objective, many overall measures and others separating degradations are used. Well focus on objective criteria recently introduced to mimic the listening tests, which are naturally able to identify the kind of degradation. This paper is organized as follows. Section 2 denes the different kinds of degradation affecting denoised speech. In section 3, we give a brief overview of tested denoising techniques. Section 4 and section 5 are reserved to the presentation of the used criteria for the overall quality, speech distortion and residual noise evaluation. In section 6, we argue experimental results.

2. Speech denoising problem and degradations denition


2.1. Background Let the corrupted speech signal y(t) be presented as y(t) = s(t) + n(t), (1)

1. Introduction
The problem of enhancing speech degraded by background noise is a research topic that has received great deal of attention over the past few decades. Many techniques are presented since early years (see for example [1] for a tour). The techniques can be broadly divided into many categories according to their conception principle and the tools they use. We relate for example spectral subtraction based techniques, statistical model based techniques, subspace techniques, perceptual based techniques,... For any speech communication application such as handsfree communications, voice over IP, hearing aids, answering machines, teleconferencing systems, car and mobile phones, cockpits and noise manufacturing,.., where speech denoising is indispensable, a preliminary deep thinking must carried to nd an answer to an unavoidable question: which speech denoising technique should be used? To answer it, we should also think about which kind of speech quality we want to obtain. Classically, users look for a good overall listening quality which can be measured using many objective speech quality measures. These criteria evaluate the overall quality of denoised speech by a single score which embedded all kinds of degradation (speech distortion, residual background noise, musical noise, clipping,...). Nowadays, novel tendencies in quality evaluation are oriented towards better precision of judgement concerning the type of perceived degradation. In fact, during listening tests, some applications tolerate a lowered background noise level, while others tolerate slightest distortions caused by denoising process. In this paper, we aim comparing many speech denoising approaches. The purpose is to nd the best denoising technique

where s(t) is the clean speech signal and n(t) is the noise signal, they are assumed to be uncorrelated. Due to the shorttime stationarity property of speech, the processing is done on a frame-by-frame basis. The Short Time Fourier Transform (STFT) is used and the previous model is re-written Y (m, k) = S(m, k) + N (m, k), (2)

where m (resp. k) denotes the frame index (resp. the frequency index). Speech denoising aims nding an estimation of short time spectral amplitude of original speech, denoted here |S(m, k)|. Classically, it is obtained using a denoising lter H(m, k) |S(m, k)| = H(m, k)|Y (m, k)|. (3)

Various considerations are taken into account to design H(m, k), we relate for example, the relative amount of noise measured in term of signal to noise ratio, the estimated noise spectral amplitude, only the audible part of the noise, masking the audible residual noise,... Once the lter design criterion is dened, the lter can be expressed and determined. For each dened lter, inevitably, different kinds of degradations are introduced, which we propose to explain in next subsection. 2.2. Degradation separation Generally, speech and noise are assumed uncorrelated. Thus, the Power Spectrum Density (PSD) of the error between clean

and denoised speech (m, k) is given by (m, k) = [H(m, k) 1]2 S (m, k) +H(m, k)2 N (m, k), (4)

where S (m, k) (resp. N (m, k)) denotes the speech PSD (resp. the noise PSD). This quantity (m, k) traduces the overall degradation. Its rst term expresses the attenuation of clean speech frequency components. In fact, since H(m, k) is used to reduce the quantity of noise in the observation signal, its amplitude is less to one. Consequently, such degradation is perceptually heard as clean speech distortion. The second term of Eq. 4 expresses the residual noise. It can be perceptually heard as a background noise or a musical noise. Since, it is additive, it is possible to assume it as a term of speech frequency components accentuation.

lated with MOS, are proposed to estimate speech quality with low cost. In this paper, we used Modied Bark Spectrum Distortion (MBSD) [10], Perceptual Evaluation of Speech Quality (PESQ) [8], recent composite criteria (COVL ) [9] and Perceptual Signal to Audible Noise and Distortion ratio (PSANDR) [12]. The correlation coefcients between these criteria and MOS are presented in Tab. 1 [13].

Table 1: Correlation coefcient between overall objective criteria and subjective score MOS. MBSD PESQ COVL PSANDR MOS 0.51 0.67 0.68 0.78

4.2. Simulation results

3. Tested denoising techniques overview


The spectral subtraction method is a well-known reduction technique. It estimates the power spectrum of clean speech by explicitly subtracting the noise power spectrum from the noisy speech power spectrum. Traditional voice activity detectors track the noise only frames of the noisy speech to update the noise power spectrum. The rst version is developed by Boll in 1979 [2]. Since then, there have been many variations in an effort to improve quality. As a popular statistical model based techniques, we relate the well-known Ephraim and Malah approach which derive a minimum mean-square error (MMSE) short-time spectral amplitude estimator under some assumptions about statistical models of discrete Fourier coefcients [3]. The frequency domain Wiener lter is also popular thanks to its simplicity of implementation [1]. Recently, perceptual based techniques are introduced to overcome the classic trade-off between noise reduction and speech distortion. Recently, many perceptual speech enhancement algorithms have been proposed. They are based on psychoacoustic models to take advantage of human auditory properties. Mainly, the masking phenomenon is exploited, it is dened as the ability of the auditory system to assimilate, in a perceptual sense, two signals close in the time and frequency domain as only one signal. Some techniques are interested in removing only audible background noise [4], some others operate as processing techniques to reduce musical noise, in perceptual sense, after classic techniques such as spectral subtraction and Wiener [5], other methods minimizes the speech distortion under the constraint of inaudible residual noise [6],... In this paper, the following techniques will be tested: spectral subtraction (denoted SS), Wiener ltering (denoted Wiener), previous techniques with perceptual post processing (denoted PPP), Gustafsson approach (denoted Gustafsson) and perceptual approach developed by Hu and Loizou (denoted Loizou).

In our simulations, we used a clean speech extracted from TIMIT database (She had your dark suit in greasy wash water all year), sampled at 8 kHz. It was articially corrupted with white Gaussian noise and babble noise, at different values of Signal to Noise Ratio (denoted SNR). The mentioned denoising techniques are applied and quality is evaluated using the already cited overall criteria. The results are resumed in Tab. 3.

5. Separate degradations evaluation


5.1. Perceptual measures separating additive noise and speech distortion The recent ITU-T recommendation P.835 [11] was designed to reduce the listeners uncertainty to the nature of components degradation (speech distortion, background noise or both of them). Hence, besides MOS, two scales are dened to measure respectively speech content (labelled SIG) and background content (labelled BAK). To estimate them quickly and with low cost, objective measures should be introduced. At the moment, there are no standard methods to separate objectively speech distortion and residual noise. At our knowledge, few works dealt with objective criteria degradation separation. The rst attempt prots from the best correlated objective criteria with subjective measures and combines them linearly to get composite criteria. The more recent ones are CSIG , measuring signal degradation and CBAK measuring background noise [9]. The second attempt uses the masking concept to measure only audible parts of the speech signal and audible parts of the degradation. They are called respectively the Perceptual Signal to Audible Noise Ratio (PSANR) for additive noise measure and the Perceptual Signal to Audible Distortion Ratio (PSADR) for speech distortion measure [12]. The correlation coefcients of these criteria with P.835 recommendation are given in Tab. 2 [13]

4. Overall degradation evaluation


4.1. Criteria overview Listening tests are the best way to judge auditive quality of speech sequences. Human listeners give their opinions about the speech quality and the average of listeners scores is the Mean Opinion Score (MOS) [7]. Since they are very expensive and time consuming, many objective criteria, well corre-

Table 2: Correlation coefcient between separate degrdations mesaures and their related subjective scores. CSIG PSADR CBAK PSANR CSIG 0.48 0.44 CBAK 0.72 0.75

6. results
In term of performances, the spectral subtraction is computationally efcient and has a simple mechanism to control the trade-off between speech distortion and residual noise, but suffers from a notorious artifact known as musical noise. The statistical model based techniques reduces musical noise and dont eliminate it completely. They have a moderate computation load, but have no mechanism to control the trade-off between speech distortion and residual noise. The perceptual based techniques have greater complexity, they improve considerably the intelligibility of speech. The trade-off between speech distortion and residual noise is imposed and no mechanism to control such trade-off is designed. Experimental results of objective evaluation related to different kinds of degradation: overall quality, speech distortion and residual noise are summarized respectively in table 3, table 4 and table 5. These tables permit the following interpretations. Overall quality evaluation The best results are obtained with techniques which use perceptual tools independently of the noise nature: white or babble. This can be explained by the fact that these techniques make a trade-off between speech distortion and residual noise. This fact improves the overall quality of the denoised speech. In the case of white noise, the PPP leads to better performances especially in low SNR. This is can be explained by the fact that in this case musical tones are numerous and the PPP success to detect and eliminate them. However, in high SNR levels, performances of perceptual techniques are almost close. Indeed, performances of these techniques depend greatly on the accuracy of the masking threshold (MT) estimation. In the case of high SNR levels, the MT is well estimated and hence the perceptual techniques operate well. In the case of babble noise, the PPP is no longer the best one. This is can be explained by the fact that the babble noise structure is so close to speech structure. So the PPP, which is based on the detection of musical tones, risks to wrongly detecting and eliminating musical tones. Distortion evaluation The less distorted signal is the noisy one. This is a logic result since noisy speech is constituted of the intact version of speech and the background noise. So, there is no reason which leads to degrade speech signal. When the noisy is enhanced, the low level of distortion is obtained with spectral subtraction and statistic methods. This catch up with listening tests. In fact, speech is not highly distorted but the amount of residual noise, especially musical noise, is very high. Residual noise evaluation The best reduction of the background noise is obtained with perceptual techniques. The same remarks concerning the overall quality evaluation are available for the residual noise evaluation. This can be interpreted as follows. The impact of residual noise on the overall quality evaluation is more important than the impact of the speech distortion.

adequate in terms of less speech distortion. However, when regarding the overall quality and residual noise, perceptual techniques performs better than the others.

8. References
[1] P. C. Loizou, Speech enhancement: theory and practise, Prentice Hall, Englewod Cliffs, 1988. [2] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 113-120; 1997. [3] Y. Ephraim, D. Malah, Speech enhancement using a mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust., Speech, Signal Processing, vol 32, pp 1109-1121, 1984 [4] S. Gustafsson, R. Martin, P. Jax and P. Valery, A psychoacoustic approach to combined acoustic echo cancellation and noise reduction, IEEE Trans. Speech and Audio Processing, vol. 10, no. 5, pp. 245-256, 2002. [5] A. Ben Aicha, S. Ben Jebara, Perceptual musical noise reduction using critical bands tonality coefcients and masking thresholds, in Proc. INTERSPEECH, AntwerpBelgium, August 2007. [6] Y. Hu, P. Loizou, Incorporating a psychoacoustical model in frequency domain speech enhancement, IEEE Signal Processing Letters, pp 270-273, 2004. [7] ITU-T P.800, Subjective assessement methods of the transmission quality, ITU-T Recommendation P.800, 1996. [8] ITU-T P.862, Perceptual evvaluation of speech quality (P ESQ), and objective method for end-to-end speech quality assessment of nerrowband telephone networks and speech codecs, ITU-T Recommendation P.862, 2000. [9] Y. Hu, and P.C. Loizou, Evaluation of objective measures for speech enhancement, in Proc. International Conference on Spoken Language Processing ICSLP, USA, 2006. [10] W. Yang, M. Benbouchta and R. Yantorno, Performance of the modied bark spectral distortion as an objective speech measure, in Proc. Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, vol 1, pp. 541-544, 1998. [11] ITU-T P.835, Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm, ITU-T Recommendation P.835, 2003. [12] A.Ben Aicha and S. Ben Jebara, Quantitative Perceptual Separation of Two Kinds of Degradation in Speech Denoising Applications, in Advances in Nonlinear Speech Processing Book, pp.230-245, edited by Springer Berlin, December 2007. [13] A.Ben Aicha and S. Ben Jebara, Speech perceptual quality measures separating speech distortion and additive noise degradations, submitted for publication at Speech Communications Journal, June 2009.

7. Conclusion
In this paper, we have tested several denoising techniques which use or not perceptual tools. Our purpose was to nd which method is the more adequate for one of 3 kinds of degradations: overall quality, speech distortion and residual noise. We found that spectral subtraction and statistical methods are the more

SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou MBSD 1.65 1.20 0.84 0.45 0.29 0.42 0.27 PESQ 1.65 1.71 1.76 1.58 1.68 1.81 0.91 MBSD 2.30 1.26 0.76 0.22 0.14 0.49 0.29 PESQ 1.26 1.45 1.67 1.96 2.02 1.44 0.84

0 dB COV L 0.07 0.22 0.63 0.85 1.08 0.59 0.45 0 dB COV L 1.13 0.76 0.92 0.27 0.63 1.31 0.34

Table 3: Overall quality evaluation (a) White noise 10 dB PSANDR MBSD PESQ COV L 1.85 1.07 1.67 0.84 2.68 0.56 2.18 1.24 2.87 0.35 2.39 1.58 4.74 0.11 2.50 1.56 4.94 0.07 2.55 1.75 2.91 0.23 2.30 1.65 4.32 0.07 2.25 1.44 (b) Babble noise 10 dB PSANDR MBSD PESQ COV L 1.79 0.78 2.19 1.89 2.66 0.56 2.30 1.67 2.66 0.39 2.33 1.80 4.05 0.21 2.31 1.42 4.12 0.13 2.356 1.62 2.89 0.19 2.69 2.33 4.77 0.10 2.12 1.63 Table 4: Distortion evaluation

PSANDR 2.59 3.41 3.63 5.04 5.21 3.58 4.96

MBSD 0.48 0.26 0.16 0.05 0.03 0.11 0.03

20 dB PESQ COV L 2.45 1.86 2.84 2.16 3.01 2.40 2.95 2.15 2.97 2.32 3.09 2.69 3.03 2.49 20 dB COV L 2.70 2.52 2.64 2.37 2.55 3.02 2.83

PSANDR 3.53 4.23 4.37 5.55 5.71 4.92 5.27

PSANDR 2.57 3.29 3.32 4.34 4.42 3.92 4.48

MBSD 0.36 0.26 0.18 0.10 0.06 0.09 0.07

PESQ 2.87 2.97 3.00 2.99 3.01 3.20 3.19

PSANDR 3.61 4.10 4.16 5.22 5.26 4.24 4.46

SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou

0 dB CSIG PSADR -1.00 16.94 -0.74 14.67 -0.19 13.73 0.02 7.00 0.33 6.44 -0.15 10.52 -1.64 5.25

white 10 dB CSIG PSADR 0.06 18.89 0.48 14.35 0.92 13.17 0.83 8.43 1.09 8.05 1.04 14.51 0.69 7.59

20 dB CSIG PSADR 1.28 20.50 1.58 14.91 1.86 14.21 1.49 11.19 1.74 10.79 2.30 14.81 1.95 13.88

0 dB CSIG PSADR 0.80 18.61 0.24 14.78 0.45 14.54 -0.42 7.05 0.02 6.58 0.99 10.81 -0.03 6.18

Babble 10 dB CSIG PSADR 1.71 19.34 1.35 15.00 1.52 14.05 0.93 8.09 1.18 7.87 2.08 10.55 1.25 8.67

20 dB CSIG PSADR 2.58 20.47 2.27 15.15 2.42 14.26 2.00 10.09 2.26 9.99 2.88 14.11 2.52 13.46

Table 5: Residual noise evaluation SN R noisy speech SS SS+PPP Wiener W+PPP Gustafsson Loizou 0 dB CBAK PSANR 1.64 -3.73 1.70 2.39 1.94 3.79 2.04 17.57 2.22 19.02 2.04 4.02 1.62 14.39 white 10 dB CBAK PSANR 2.26 1.87 2.50 7.89 2.70 9.48 2.64 19.83 2.83 21.15 2.77 9.16 2.65 19.25 20 dB CBAK PSANR 3.14 8.92 3.30 13.99 3.46 15.06 3.31 23.74 3.43 24.96 3.52 19.12 3.57 21.71 0 dB CBAK PSANR 1.72 -4.14 1.46 2.27 1.57 2.26 1.22 12.44 1.53 12.90 1.95 3.88 1.54 17.78 Babble 10 dB CBAK PSANR 2.45 1.73 2.32 7.02 2.41 7.17 2.15 14.62 2.36 15.21 2.79 11.52 2.47 15.68 20 dB CBAK PSANR 3.30 9.50 3.18 13.07 3.28 13.44 3.09 21.27 3.26 21.54 3.55 14.04 3.55 15.66

Das könnte Ihnen auch gefallen