Sie sind auf Seite 1von 4

Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010)

23-25 December, 2010, Dhaka, Bangladesh

Performance Evaluation of MLPC and MFCC for HMM based


Noisy Speech Recognition

Mizanur Rahman, Md. Babul Islam
Dept. of Computer Science and Engineering, Islamic University, Kushtia, Bangladesh

Dept. of Applied Physics and Electronic Engineering, Rajshahi University, Rajshahi, Bangladesh
s.feraree@yahoo.com, babul.apee@ru.ac.bd

Abstract warped all-pole model have been rarely used in


In this paper auditory like features MLPC and MFCC automatic speech recognition. Recently, as an LP-based
have been used as front-end and their performance has method, a simple and efficient time-domain technique
been evaluated on Aurora-2 database for Hidden to estimate all-pole model on the mel-frequency scale is
Markov Model (HMM) based noisy speech recognition. proposed in [11], which is referred to as a “Mel-LPC”
The clean data set is used for training and test set A is analysis. The prediction coefficients are estimated
used to examine the performance. It has been found without any approximation by minimizing the
that almost the same recognition performance has been prediction error power at a two-fold computational cost
obtained both for MLPC and MFCC and the average over the standard LPC analysis.
word accuracy for MLPC and for MFCC is found to be In this paper an automatic speech recognition (ASR)
59.05% and 59.21%, respectively. It has also been system is developed. As front-end features both the
observed that the MLPC is more effective than MFCC MFCC and MLPC are used and the effectiveness of
for noise type subway and exhibition, on the other these features on noise category is evaluated for HMM
hand, MFCC is more superior for babble and car based noisy speech recognition.
noises.
Keywords: MLPC, MFCC, HMM, Bilinear II. MEL-LP ANALYSIS
transformation, Noisy speech recognition. The frequency-warped signal ~ x [n] (n = 0, 1, … , ∞)
obtained by the bilinear transformation [10] of a finite
I. INTRODUCTION length windowed signal x[n] (n = 0, 1, … , N − 1) is
Designing a front-end incorporating auditory-like defined by
frequency resolution improves recognition accuracy
[1]–[3]. Therefore, we need to parameterize the ∞ N −1
~
perceptually relevant aspects of short-term speech X (~z ) = ∑ ~
x [n]~z −n = X ( z ) = ∑ x[n]z −n (1)
spectra and their dynamics in ASR front-end, in order n =0 n =0
to enhance the performance of Automatic Speech
Recognition (ASR). where ~
z −1 is the first-order all-pass filter,
In nonparametric spectral analysis, Mel-frequency −1
Cepstral Coefficient (MFCC) [1] is one of the most ~z −1 = z − α
(2)
popular spectral features in ASR. This parameter takes 1 − α . z −1
account of the nonlinear frequency resolution like the
human ear. The phase response of ~
z −1 is given by
In parametric spectral analysis, the linear prediction ~ ⎧ α sin λ ⎫
−1
coding (LPC) analysis [4], [5] based on an all-pole λ = λ + 2 ⋅ tan ⎨ ⎬ (3)
model is widely used because of its computational ⎩ 1 − α cos λ ⎭
simplicity and efficiency. While the all-pole model
This phase function determines a frequency mapping.
enhances the formant peaks as an auditory perception,
other perceptually relevant characteristics are not Now, the all-pole model on the warped frequency scale
incorporated into the model unlike MFCC. To alleviate is defined as
this inconsistency between the LPC and the auditory
~ σ~ e
H a (~z ) =
analysis, several auditory spectra have been simulated
before the all-pole modeling [2], [6] – [8]. p (4)
In contrast to the different spectral modification, Strube 1+ ∑ a~ k ~z − k
[9] proposed an all-pole modeling to a frequency k =1
warped signal which is mapped onto a warped
where a~k is the k-th mel-prediction coefficient and σ~e2
frequency scale by means of the bilinear transformation
[10], and investigate several computational procedures. is the residual energy [9].
However, the methods proposed in [9] to estimate
On the basis of minimum prediction error energy for The mel-cepstral coefficients can also be calculated
~x [n] over the infinite time span, a~k and σ~e are directly from mel-prediction coefficients {a~k } [13]
obtained by Durbin’s algorithm from the using the following recursion:
autocorrelation coefficients ~
r [m] of ~
x [n] defined by
1 k −1
∞ c k = − a~k − ∑ ( k − j ) a~k c k − j (10)
r [m] = ∑ ~
~ x [ n ]~
x [ n − m] (5) k j =1
n =0
It should be noted that the number of cepstral
which is referred to as mel-autocorrelation function.
coefficients need not be the same as the number of
The mel-autocorrelation coefficients can easily be prediction coefficients.
calculated from the input speech signal x[n] via the
following two steps [11], [12]. First, the generalized III. FILTER-BANK BASED ANALYSIS
autocorrelation coefficients are calculated as In filter-bank based systems, MFCC [1] is widely used
spectral features. This parameter takes account of the
N −1
~
rα [m] = ∑ x[n]xm [n] (6) nonlinear frequency resolution as like the human ear.
The mel-scale filter-bank is illustrated in Fig. 2. As can
n =0
be seen, the filters used are triangular and equally
where x m [n] is the output signal of an m-th order all spaced along the mel-scale which is defined by
pass filter ~z − m excited by x0 [n] = x[n] . That is, ⎛ f ⎞
~ Mel ( f ) = 2595 log10 ⎜1 + ⎟ (11)
r [m] is defined by replacing the unit delay z −1 with
α ⎝ 700 ⎠
the first order all-pass filter ~
z ( z ) −1 in the definition of where f is the frequency in Hz. Usually, the triangular
conventional autocorrelation function as shown in filters are spread over the whole frequency range from
Fig. 1. Due to the frequency warping, ~ rα [m] includes zero up to the Nyquist frequency.
~ jλ~
the frequency weighting W (e ) defined by To implement this filter-bank, first, Fourier transform is
applied to the windowed speech signal and the
~ 1−α 2 magnitude is calculated. Each FFT magnitude
W (~z ) = (7) coefficient is then multiplied by the corresponding filter
1 + α~
z −1 gain and the results are accumulated. Thus, each bin
which is derived from holds a weighted sum representing the spectral
magnitude in that filter-bank channel. As an alternative,
dλ ~ jλ~ 2 the power can be used rather than the magnitude of
~ = W (e ) (8)
FFT in the binning process.

Thus, in the second step, the weighting is removed by Finally, the Mel-Frequency Cepstral Coefficients
inverse filtering in the autocorrelation domain using (MFCCs) are calculated from the log filter-bank
{ ~ ~ −1 −1
W (~z )W ( ~
z ) . } { }
amplitudes m j using the Discrete Cosine Transform
(DCT) as follows:
As feature parameters for recognition, the Mel-LP
N −1
cepstral coefficients can be expressed as: 2 ⎛πi ⎞
ci =
N
∑ m j cos⎜⎝ N ( j + 0.5)⎟⎠ (12)
∞ j =0
~
log Η a ( ~z ) = ∑ c k ~z − n (9) where N is the number of filter-bank channels.
n=0

where {c k } are the mel-cepstral coefficients.

x[n] Cross. ~
rα [m]
Frequency
x m [n]
Energy in each bin
~
z (z )
−m
Fig. 2: Mel-scale filter-bank.

Fig. 1: Generalized autocorrelation function.


IV. EVALUATION ON AURORA-2 where N is the total number of words. D, S and I are
deletion, substitution and insertion errors, respectively.
DATABASE
A. Experimental Setup B. Recognition Results
The proposed system was evaluated on Aurora-2 The detail recognition results have been presented in
database [14], which is a subset of TI digits database this section both for MLPC and MFCC front-ends. The
contaminated by additive noises and channel effects. It recognition accuracy for MLPC and MFCC are listed in
should be noted that the whole Aurora-2 database was Table II and Table III, respectively. The average
not used in this experiment rather a subset of this recognition accuracy for MLPC and MFCC are found
database was used as shown in Table I. to be 59.05% and 59.21%, respectively. Hence, there is
TABLE I: Definition of training and test data. no significant difference in word accuracy between
MLPC and MFCC on the average.
Data set Noise Type SNR [dB]
Training Clean − ∞ From Table II and Table III, we have also found that
Test Test set A Subway, clean, 20, 15, the MLPC is more effective than MFCC for noise type
Babble, Car, 10, 5, 0, -5 subway and exhibition; on the other hand, MFCC is
Exhibition more superior for babble and car noises.

The recognition experiments were conducted with a V. CONCLUSION


12th order Mel-LP analysis and a filter-bank based An HMM based automatic speech recognition (ASR)
analysis where the number of filter-banks was 22. The system has been developed to evaluate the performance
preemphasized speech signal with a preemphasis factor of auditory-like features MLPC and MFCC as front-
of 0.95 was windowed using Hamming window of end. It has been found that for MFCC, the average ward
length 20 ms with 10 ms frame period. The frequency accuracy is obtained 51.87% and 56.59% for babble
warping factor was set to 0.35. As front-end, 14 and car noises, respectively. On the contrary, in the
cepstral coefficients and their delta coefficients case of MLPC 48.06% average word accuracy is
including 0th terms were used. Thus, each feature obtained for babble noise and 53.77% for car noise. But
vector size is 28 both for MLPC and MFCC front-ends. for the subway and exhibition noises, MLPC gives
The reference recognizer was based on HTK (Hidden 68.30%and 66.05% average word accuracy,
Markov Model Toolkit). The HMM was trained on respectively, where, MFCC gives 64.28% and 64.09%
clean condition with 16 states per word and a mixture average word accuracy, respectively.
of 3 Gaussians per state using left-to-right models. From the above discussion we can conclude that,
The recognition accuracy (Acc) is evaluated as follows: MLPC is more effective than MFCC for noise type
subway and exhibition; on the other hand MFCC is
N −D−S −I more suitable for babble and car noises.
Acc = × 100% (13)
N

Table II: Word Accuracy (%) for MLPC front-end.


Noise SNR [dB] Average
Clean 20 15 10 5 0 -5 (20 to 0 dB)
Subway 98.71 96.93 93.43 78.78 49.55 22.81 11.08 68.30
Babble 98.61 89.96 73.76 47.82 21.95 6.80 4.44 48.06
Car 98.54 95.26 83.03 54.25 24.04 12.23 8.77 53.77
Exhibition 98.89 96.39 92.72 76.58 44.65 19.90 11.94 66.05
Average 98.69 94.64 85.74 64.36 35.05 15.44 9.06 59.05

Table III: Word Accuracy (%) for MFCC front-end.


Noise SNR [dB] Average
Clean 20 15 10 5 0 -5 (20 to 0 dB)
Subway 98.83 95.61 90.08 72.74 44.03 18.94 9.70 64.28
Babble 98.91 91.99 76.42 52.00 26.39 12.55 8.77 51.87
Car 98.78 95.71 85.12 61.59 30.51 9.99 7.01 56.59
Exhibition 98.95 95.87 90.31 74.54 42.61 17.09 8.73 64.09
Average 98.87 94.80 85.49 65.22 35.89 14.65 8.56 59.21
REFERENCES
[1] S. Davis and P. Mermelstein, “Comparison of
parametric representations for monosyllabic word
recognition in continuously spoken sentences,”
IEEE Trans. on Acoustics, Speech, and Signal
Processing, Vol. ASSP-28, No. 4, pp. 357-366,
1980.
[2] H. Hermansky, “Perceptual linear predictive (PLP)
analysis of speech,” The Journal of the Acoustical
Society of America, vol. 87, no. 4, pp. 17-29,
1987.
[3] N. Virag, “Speech enhancement based on masking
properties of the auditory system”, Proc.
ICASSP’95, pp.796-799, 1995.
[4] F. Itakura and S. Saito, “Analysis synthesis
telephony based upon the maximum likelihood
method”, Proc. of 6th International Congress on
Acoustics, Tokyo, p.C-5-5, 1968.
[5] B. Atal and M. Schroeder, “Predictive coding of
speech signals”, Proc. of 6th International Congress
on Acoustics, Tokyo, pp. 21-28, 1968.
[6] Makhoul and L. Cosell, “LPCW: An LPC vocoder
with linear predictive warping”, Proc. of ICASSP
’76, pp. 446-469, 1976.
[7] S. Itahashi and S. Yokoyama, “A formant
extraction method utilizing mel scale and equal
loudness contour”, Speech Transmission Lab.-
Quarterly Progress and Status Report (Stockholm)
(4), pp. 17-29, 1987.
[8] M. G. Rahim and B. H. Juang, “Signal bias
removal by maximum likelihood estimation for
robust telephone speech recognition ”, IEEE Trans.
on Speech and Audio Processing, Vol. 4, No. 1,
pp. 19-30, 1996.
[9] H. W. Strube, “Linear prediction on a warped
frequency scalle”, J. Acoust. Soc. Am., vol. 68,
no. 4, pp. 1071-1076, 1980.
[10] A. V. Oppenheim and D. H. Johnson, “Discrete
representation of signals,” IEEE Proc., vol. 60, no.
6, pp. 681-691, 1972.
[11] H. Matsumoto, Y. Nakatoh and Y. Furuhata, “An
efficient Mel-LPC analysis method for speech
recognition”, Proc. ICSLP ’98, pp. 1051-1054,
1998.
[12] S. Nakagawa, et al., ed., “Spoken language
systems,”Ohmsha, Ltd., Japan, ch.7, 2005.
[13] J. Markel and A. Gray, “Linear prediction of
speech”, Springer-Verlag, 1976.
[14] H. G. Hirsch and D. Pearce, “The AURORA
experimental framework for the performance
evaluation of speech recognition systems under
noisy conditions,” ISCA ITRW ASR 2000,
September 2000.

Das könnte Ihnen auch gefallen