Significance of The Modified Group Delay Feature in Speech Recognition

190
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007
Signicance of the Modied Group Delay Feature in Speech Recognition

Rajesh M. Hegde, Hema A. Murthy, Member, IEEE, and Venkata Ramana Rao Gadde
AbstractSpectral representation of speech is complete when both the Fourier transform magnitude and phase spectra are specied. In conventional speech recognition systems, features are generally derived from the short-time magnitude spectrum. Although the importance of Fourier transform phase in speech perception has been realized, few attempts have been made to extract features from it. This is primarily because the resonances of the speech signal which manifest as transitions in the phase spectrum are completely masked by the wrapping of the phase spectrum. Hence, an alternative to processing the Fourier transform phase, for extracting speech features, is to process the group delay function which can be directly computed from the speech signal. The group delay function has been used in earlier efforts, to extract pitch and formant information from the speech signal. In all these efforts, no attempt was made to extract features from the speech signal and use them for speech recognition applications. This is primarily because the group delay function fails to capture the short-time spectral structure of speech owing to zeros that are close to the unit circle in the -plane and also due to pitch periodicity effects. In this paper, the group delay function is modied to overcome these effects. Cepstral features are extracted from the modied group delay function and are called the modied group delay feature (MODGDF). The MODGDF is used for three speech recognition tasks namely, speaker, language, and continuous-speech recognition. Based on the results of feature and performance evaluation, the signicance of the MODGDF as a new feature for speech recognition is discussed. Index TermsClass separability, feature extraction, feature selection, Gaussian mixture models (GMMs), group delay function, hidden Markov models (HMMs), phase spectrum, robustness.
auditory perception. The Mel scale was used by Mermelstein and Davis [3] to extract features from the speech signal for improved recognition performance. Spectral features are generally computed from the short-time Fourier transform power spectrum. Short-time Fourier analysis can be used to process speech assuming that it is quasi-stationary after applying a be a given speech window on the speech signal. Let its short-time Fourier transform (STFT) sequence and on the speech signal [4] after applying a window
(1) The STFT can also be expressed as
(2) In (1), represents the short time over which the Fourier transform is evaluated. In (2), corresponds to the short-time corresponds to the phase specmagnitude spectrum and is called trum. The square of the magnitude spectrum the short-time power spectrum. The speech signal is, therefore, completely characterized by both the short-time magnitude and phase spectra. However, most spectral features are derived from the STFT magnitude spectrum, while the short-time phase spectrum is not used. This is primarily because of the complex issues and nonuniqueness associated with unwrapping the phase spectrum. In [5], the phase spectrum has been used for improved speech recognition performance. The phase spectrum has been used for various speech processing tasks in [6] via the group delay domain. Previous work on the use of group delay and modied group delay functions concentrates on spectrum estimation [7], signal reconstruction [8], and extraction of source and system information [6] from the speech signal. In this paper, the focus is on extracting features from the speech signal using the modied group delay function [9]. In Section II, the theory, properties and importance of the group delay function in speech processing is discussed. The basis and the need for modifying the group delay function are also discussed in this section. The modied group delay function [9][11] and related issues are discussed in Section III. In Section IV, the modied group delay function is converted to cepstral coefcients using the discrete cosine transform [12]. The feature thus extracted is called the modied group delay feature (MODGDF). The MODGDF is analyzed for robustness and cumulative separability of different feature dimensions in Section V. In Section VI, the performance of
I. INTRODUCTION
EVERAL techniques have been used to extract features from speech [1]. With the advent of Markov models in speech recognition, spectral features that are perceptually meaningful and invariant to the ambient acoustic environment have become increasingly common. Stevens and Volkman [2] developed the Mel scale as a result of a study of the human
Manuscript received April 6, 2005; revised September 26, 2005. The work of R. M. Hegde was supported by the National Science Foundation under Awards 0331707 and 0331690. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Geoffrey Zweig. R. M. Hegde was with the Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai-36, India. He is now with the California Institute of Telecommunication and Information Technology, University of California at San Diego, La Jolla, CA 92093 USA (e-mail: rhegde@ucsd. edu). H. A. Murthy is with the Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai-36, India (e-mail: hema@lantana.tenet.res.in). V. R. R. Gadde is with the Speech Technology and Research (STAR) Laboratory, SRI International, Menlo Park, CA 94025 USA (e-mail: rao@speech. sri.com). Digital Object Identier 10.1109/TASL.2006.876858
1558-7916/$20.00 2006 IEEE
HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE
191
the MODGDF is evaluated for three speech recognition tasks, namely speaker, language, and syllable recognition. The extraction of the MODGDF and other conventional features used in these tasks is discussed in Section VI-B2. The computation of (length of the MODGDF involves three free parameters the window used in the cepstral domain), , and . The choice of these three parameters is crucial for optimal recognition performance across all databases and unseen data. The estimation of optimal values for these three free parameters from a signal processing perspective and also using line search is discussed in Sections VI-C and VI-F, respectively. The feature is evaluated for the speaker identication task [13][15] on the TIMIT (clean speech) [16] and the NTIMIT (noisy telephone speech) [17], [18] databases and the results are listed in Section VI-G. The results of language identication experiments conducted on the DBIL database [19] for the three language task and the OGI_MLTS database [20] for the 11 language task using the MODGDF as the front end are presented in Section VI-H. Experimental results for syllable recognition using the MODGDF on two Indian languages [19] are described in Section VI-I. We conclude with a discussion on the importance of the MODGDF in speech processing, its potential applications, and future issues in Section VII. II. THEORY AND PROPERTIES OF GROUP DELAY FUNCTIONS The characteristics of the speech signal can be visually perceived in the short-time magnitude spectra in comparison to the short-time phase spectra. With specic reference to the speech signal, it can be stated that the resonances of the speech signal present themselves as the peaks of the envelope of the short-time magnitude spectrum. These resonances often called formants manifest as transitions in the short-time phase spectrum. The problem with identifying these transitions is the masking of these transitions due to the wrapping of the . Hence, any short-time phase spectrum at multiples of meaningful use of the short-time phase spectrum for speech processing involves the nonunique process of phase unwrapping. The group delay function which is dened as the negative derivative of the unwrapped short-time phase spectrum, can be computed directly from the speech signal as in [21] without unwrapping the short-time phase spectrum. The group delay function has been effectively used to extract various source and system parameters [22] when the signal under consideration is a minimum phase signal. This is primarily because the magnitude spectrum of a minimum phase signal [22] and its group delay function resemble each other [23]. A. Group Delay Function Group delay is dened as the negative derivative of the Fourier transform phase. Mathematically, the group delay function is dened as
away from a constant indicates the degree of nonlinearity of the phase. The Fourier transform phase and the Fourier transform magnitude are related as in [8]. The group delay function can also be computed from the signal as in [9] using
(4) (5) where the subscripts and , denote the real and imaginary and are the Fourier parts of the Fourier transform. transforms of and , respectively. It is also important to note in this context that the group delay function can be expressed in terms of the cepstral coefcients as
(6) where are the dimensional cepstral coefcients. Hence, in general can also be viewed as the group delay function the Fourier transform of the weighted cepstrum. B. Group Delay Spectrum and Magnitude Spectrum In general, if we consider the spectrum of any signal as a cascaded of resonators, the frequency response of the overall lter is given by [23]
(7) where is the complex pair of poles of the th resonator. The squared magnitude spectrum is given by
(8) and the phase spectrum is given by
(9) It is well known that the magnitude spectrum of an individual resonator has a peak at and a half-power bandwidth of . The group delay function can be derived using (9) and is given by
(3) of a signal is dened as a where the phase spectrum continuous function of . The values of group delay function
(10) It was shown in [23] that at the resonance frequency , the group delay function behaves like a squared magnitude response.
192
C. Restriction of Minimum Phase The group delay function can be effectively used for various speech processing tasks only when the signal under considerais dened as a tion is a minimum phase signal. A signal and its inverse are minimum phase signal if both energy bounded and one-sided signals. Alternately, as viewed is a minimum phase signal if and only if in the Z domain, lie within the all the poles and zeroes of the -transform of unit circle. Mathematically, a minimum phase system is dened as (11) and . where The phase and magnitude spectra of a minimum phase signal are related through the Hilbert transform [21]. An analysis of the group delay functions and their relevance for different types of systems like minimum phase, maximum phase, etc., can be found in [8]. D. Properties of Group Delay Functions The group delay functions and their properties have been discussed in [22] and [24]. The two main properties of the group delay functions [24] of relevance to this work are as follows: additive property; high-resolution property. 1) Additive Property: The group delay function exhibits an additive property. Let (12) where and are the responses of the two resonators whose product gives the overall system response. Taking absolute value on both sides we have (13) Using the additive property of the Fourier transform phase (14) Then, the group delay function is given by
Fig. 1. Comparison of the minimum phase group delay function with the magnitude and linear prediction (LP) spectrum. (a) The z -plane with three poles inside the unit circle. (b) The magnitude spectrum of the system shown in (a). (c) The LPC spectrum of the system shown in (a). (d) The group delay spectrum of the system shown in (a).
[24]. An illustration is given in Fig. 1 to highlight the high-resolution property of the group delay function over both the magnitude and linear prediction spectrum. Fig. 1(a) shows the -plane plot of the system consisting of three complex conjugate pole pairs. Fig. 1(b) is the corresponding magnitude spectrum, while Fig. 1(c) illustrates the spectrum derived using LPC analysis, and Fig. 1(d) is the corresponding group delay spectrum. It can be clearly observed that the three formants are resolved better in the group delay spectrum when compared to the magnitude or linear prediction spectrum. From these results it is also evident that the system information in the speech signal is captured relatively better by the group delay spectrum when compared to the magnitude or linear prediction spectrum. E. Basis for Modifying the Group Delay Function It has been shown in [25] that group delay functions can be used to accurately represent signal information as long as the roots of the -transform of the signal are not too close to the unit circle in the -plane. It is also true that the vocal tract system and the excitation contribute to the envelope and the ne structure respectively of the speech spectrum. When the Fourier transform magnitude spectrum is used to extract speech features, the focus is on capturing the spectral envelope of the spectrum and not the ne structure. Similarly, the ne structure has to be de-emphasized when extracting the vocal tract characteristics from the group delay function. The zeros that are close to the unit circle manifest as spikes in the group delay function and the strength of these spikes is proportional to the proximity of these zeros to the unit circle. To illustrate this a four formant system with four poles and their complex conjugates is simulated. The pole-zero plot of the four formant system is shown in Fig. 2(a) while the corresponding group delay spectrum is shown in 2(b). Fig. 2(c) shows the pole-zero plot of the same system with zeros added uniformly in very close proximity to the unit circle. It is evident from Fig. 2(d) that the group delay spectrum for such a system
(15) where and correspond to the group delay and , respectively. From (12) and function of (15), it is clear that multiplication in the spectral domain becomes an addition in the group delay domain. The additive property of the group delay functions is also discussed in [24]. 2) High-Resolution Property: The group delay function has a higher resolving power when compared to the magnitude spectrum. The ability of the group delay function to resolve closely spaced formants in the speech spectrum has been investigated in
193
delay function as in (5) with its cepstrally smoothed version, . A. Signicance of Cepstral Smoothing Assuming a source system model of speech production the -transform of the system generating the speech signal is given by
(16) where the polynomial is the contribution due to zeros and is the contribution due to the poles of the the polynomial is given by vocal tract system. The frequency response of
(17)
Fig. 2. Signicance of proximity of zeros to the unit circle. (a) The z -plane with four poles inside the unit circle. (b) The group delay spectrum of the system shown in (a). (c) The z -plane with four poles inside the unit circle and zeros added uniformly on the unit circle. (d) The group delay spectrum of the system shown in (c). (e) The z -plane with zeros pushed radially inward into the unit circle. (f) The group delay spectrum of the system shown in (e).
where and are obtained by evaluating the polynomials on the unit circle in -domain. By using the additive property of the group delay function, the group delay function of the is given by system characterized by (18)
becomes very spiky and ill dened primarily due to zeros that are added in very close proximity to the unit circle in the -plane. In Fig. 2(e), we manually move all the zeros radially into the unit circle and recompute the group delay function of such a system. The group delay spectrum of such a system is shown in Fig. 2(f). It is clear that this technique of pushing the zeros into the unit circle radially restores the group delay spectrum without any distortions in the original formant locations. The spikes introduced by zeros close to the unit circle form a signicant part of the ne structure and cannot be eliminated by normal smoothing techniques. Hence, the group delay function has to be modied to eliminate the effects of these spikes. The considerations discussed so far in this section form the basis for modifying the group delay function. III. MODIFIED GROUP DELAY FUNCTION As mentioned in Section II-E, for the group delay function to be a meaningful representation, it is only necessary that the roots of the transfer function are not too close to the unit circle in the plane. Normally, in the context of speech, the poles of the transfer function are well within the unit circle. The zeros of the slowly varying envelope of speech correspond to that of nasals. The zeros in speech are either within or outside the unit circle since the zeros also have nonzero bandwidth. In this section, we modify the computation of the group delay function to suppress these effects. A similar approach was taken in an earlier paper by one of the authors [7] for spectrum estimation. Let us reconsider the group delay function derived directly from the speech signal. in (5) It is important to note that the denominator term becomes zero, at zeros that are located close to the unit circle. The spiky nature of the group delay spectrum can be overcome by replacing the term in the denominator of the group
where and are the group delay functions of and , respectively. Spikes of large amplitude are introprimarily due to zeros of close to the duced into unit circle. As already discussed, the group delay function can be directly computed from the speech signal as
(19) The group delay function for in (18) can be written as
(20) is the numerator term of (19) for . As tends to zero (for zeros on the unit circle), has large amplitude spikes. Similarly, the group delay function for in (18) can be written as where
(21) is the numerator term of (19) for . The term does not take values very close to zero since has all roots well within the unit circle. Therefore, the term contains the information about the poles of the system and has no spikes of large amplitude. Substituting (20) and (21) in (18), we have where
(22)
194
where
and are the numerator terms of (19) for and , respectively. Assuming that the envelope of is nearly at (zero spectrum), multiplying with will emphasize the resonant peaks of the second term (23)
This leads to the initial form of the modied group delay function which is given by (24) Substituting (22) in (24) (25) In (25), an approximation to is required, which is a nearly at spectrum (ideally a zero spectrum). An approximato can be computed as tion
Fig. 3. Comparison of various spectra for a synthetic signal. (a) The synthetic signal with two resonances. (b) The log magnitude spectrum of the signal shown in (a). (c) The root magnitude spectrum (root = 2=3) of the signal shown in (a). (d) The group delay spectrum of the signal shown in (a). (e) The modied group delay spectrum of the signal shown in (a).
(26) where is the squared magnitude of the signal and is the cepstrally smoothed spectrum of [7], [26]. Alternately, the the modied group delay function can be dened as (27) Therefore, the modied group delay function is capable of pushing zeros on the unit circle, radially into the unit circle, and thus emphasizing which corresponds to the contribution from the poles of the vocal tract system. B. Denition of the Modied Group Delay Function Since the peaks at the formant locations are very spiky in nature, two new parameters and are introduced to reduce the amplitude of these spikes and to restore the dynamic range of the speech spectrum. The new modied group delay function is dened as (28) where
in Fig. 3(d) and (e), respectively. The resonant frequencies are clearly visible in the log magnitude and root magnitude spectrum while the group delay function does not show any structure. This is primarily because the synthetic signal is nonminimum phase. Clearly, what is required is a modication to the group delay function that will yield spectra similar to that of the minimum phase group delay function [24]. This is what is achieved in the modied group delay function as illustrated in Fig. 3(e). IV. PARAMETERIZING THE MODIFIED GROUP DELAY FUNCTION Since the modied group delay function exhibits a squared magnitude behavior at the location of the roots, we refer to the modied group delay function as the modied group delay spectra henceforth. Homomorphic processing is the most commonly used approach to convert spectra derived from the speech signal to meaningful features. This is primarily because this approach yields features that are linearly decorrelated which allows the use of diagonal covariances in modeling the speech vector distribution. In this context, the discrete cosine transform (DCT I,II,III) [12] is the most commonly used transformation that can be used to convert the modied group delay spectra to cepstral features. Hence, the group delay function is converted to cepstra using the discrete cosine transform (DCT II) as
(30) (29) where is the smoothed version of . The parameters and introduced vary from 0 to 1 where and . Fig. 3(a) shows a synthetic signal with two resonances. Fig. 3(b) and (c) shows the log magnitude and , respectively. The the root magnitude spectrum root group delay and the modied group delay spectrum are shown is the discrete Fourier transform (DFT) order and is the modied group delay spectrum. The discrete cosine transform can also be used in the reconstruction of the modied group delay spectra from the modied group delay cepstra. Velocity and acceleration parameters for the new group delay function are dened in the cepstral domain, in a manner similar to that of the velocity and acceleration parameters for MFCC. where
195
A. Importance of In the form of the modied group delay cepstrum dened in (30), the rst coefcient corresponding to , with is generally ignored [see (6)]. This value corresponds to the average value in the group delay function. Owing to the effects of linear phase due to the window and the location of pitch peaks with respect to the window, it is really not clear how important the value is for recognition. Nevertheless, if we ignore the effects of of the window and pitch peaks, the group delay must also contain additional information, in terms of the delays in sources corresponding to that of the formants. This will result in an average value different from zero in the group delay domain. So, it might notbeappropriate toignoretherst coefcientintheinverseDCT. We, therefore, dene the MODGDF in the form, where we use with rather than . This is primarily because computation of the MODGDF from the modias in (6). The ed group delay spectrum essentially yields relation in (6) is also discussed in [7]. V. FEATURE EVALUATION OF THE MODGDF In this Section we evaluate the MODGDF against several feature evaluation criteria and discuss their signicance in automatic speech, speaker, and language recognition tasks. A. Robustness Features that are invariant to noise save additional processing like cepstral mean subtraction, and eliminate sources of distortion. Representation of speech in the group delay domain enhances important features of the envelope of the short-time speech spectrum making it relatively immune to noise when compared to that of the short-time magnitude spectrum. 1) Robustness to Convolutional and White Noise: Assuming a source system model of speech production, the clean speech , its Fourier transform and the corresponding group delay function [9] is given by (31) (32) Similarly, the noisy speech signal and its Fourier transform are given by (33) (34) where is the time invariant channel response and is the additive white noise. Taking the Fourier transform of (31) and the corresponding group and substituting in (34), is given by delay function (35) (36) where is the group delay function corresponding to that of , and is the group delay function corresponding to . Further, the term in dominates in high
signal-to-noise ratio (SNR) regions and the term in dominates in low SNR regions. Since is , the question of noise being chosen such that emphasized does not arise. In the high SNR, case it is the excitation, and in the second case it is white noise that makes the group delay spectrum spiky and distorted primarily due to zeros that are very close to the unit circle in the -domain. White noise has a at spectral envelope and, hence, contributes zeros very close to the unit circle. Further, the locations and amplitudes of these spikes is also not known. To suppress these spikes, the behavior of the spectrum where the noise zeros contribute to sharp nulls is utilized. A spectrum with a near at spectral envelope containing the spectral shape contributed by the zeros is derived using cepstral smoothing as discussed in Section III-A and multiplied with the group delay function to get the modied group delay function as in (44) and (45). The effects due to the excitation can be dealt with by pushing all zeros very close to the unit circle in the -domain, well inside the unit circle by appropriately selecting values for the two parameters and as dened in (44) and (45). 2) Comparison to Log and Root Compressed Cepstra: It is known that log-cepstral analysis is sensitive to noise and the root compressed cepstral approaches [27], [28] represent speech better in noise. In this section, we compare the log and root compression approaches with the MODGDF in the presence of white noise at different values of SNR. We pick 20 complete sentences from different dialect regions, consisting of both female and male speakers, from the TIMIT database. These sentences are added with white noise scaled by a factor . The value of is varied and the SNR computed. The average error distributions between the clean and the noisy speech across all frames corresponding to the 20 sentences are then calculated for four different values of SNR 0, 3, 6, and 10 dB. In Fig. 4, we compare the average error distributions of the the MODGDF , ), the spectral root compressed cepstra [27] ( root , the energy root compressed cepstra [28] root , and the log compressed cepstra (MFCC).1 Fig. 4(a)(d) corresponds to the average error distribution of the MODGDF computed for an SNR of 0, 3, 6, and 10 dB, respectively, while Fig. 4(e)(h) corresponds to the average error distribution of the spectral root compressed cepstra (SRC) computed for a SNR of 0, 3, 6, and 10 dB, respectively. Fig. 4(i)(l) shows the average error distribution of the the energy root compressed cepstra (ERC) computed for a SNR of 0, 3, 6, and 10 dB, respectively, while Fig. 4(m)(p) corresponds to the average error distribution of the log compressed cepstra (MFC) computed for an SNR of 0, 3, 6, and 10 dB, respectively. It is clear from Fig. 4 that average deviation of the noisy speech cepstra from the clean speech cepstra is the least for the MODGDF when compared to either the spectral root, the energy root, or the log compressed cepstra. B. Signicance of Removing Channel Effects in the Group Delay Domain Owing to the nonlinearities that are introduced by and , the removal of the channel effects is an issue. It is clear,
1The free parameters were optimized using line search for the different cepstra.
196
This parameter denes the neness of the envelope of the modied group delay function for mean computation. C. Separability Analysis in the High-Dimensional Feature Space The most commonly used separability measures in speech recognition are the geometrically intuitive measures like the F-ratio and mathematical measures like the Chernoff and the Bhattacharya bound [29]. The Bhattacharya bound which is a special case of the Chernoff bound is a probabilistic error measure and relates more closely to the likelihood maximization classiers that we use for performance evaluation. The results of analysis presented herein are for measuring class separability between speakers, in the context of speaker identication and languages, in the context of language identication. We refer to speakers and languages as classes, following the general practice in pattern recognition terminology, in the analysis that follows. The Bhattacharya distance [29], is dened as
Fig. 4. Comparison of the average error distributions of the MODGDF, MFCC, and root compressed cepstra in noise. (a) Error distribution of the MODGDF ( = 0:4, = 0:9) at 0-dB SNR. (b) Error distribution of the MODGDF ( = 0:4, = 0:9) at 3-dB SNR. (c) Error distribution of the MODGDF ( = 0:4, = 0:9) at 6-dB SNR. (d) Error distribution of the MODGDF ( = 0:4, = 0:9) at 10-dB SNR. (e) Error distribution of the spectrally root compressed cepstra (root = 2=3) at 0-dB SNR. (f) Error distribution of the spectrally root compressed cepstra (root = 2=3) at 3-dB SNR. (g) Error distribution of the spectrally root compressed cepstra (root = 2=3) at 6-dB SNR. (h) Error distribution of the spectrally root compressed cepstra (root = 2=3) at 10-dB SNR. (i) Error distribution of the energy root compressed cepstra (root = 0:08) at 0-dB SNR. (j) Error distribution of the energy root compressed cepstra (root = 0:08) at 3-dB SNR. (k) Error distribution of the energy root compressed cepstra (root = 0:08) at 6-dB SNR. (l) Error distribution of the energy root compressed cepstra (root = 0:08) at 10-dB SNR. (m) Error distribution of the MFCC at 0-dB SNR. (n) Error distribution of the MFCC at 3-dB SNR. (o) Error distribution of the MFCC at 6-dB SNR. (p) Error distribution of the MFCC at 10-dB SNR.
(37) Assuming that the distributions are Gaussian, the probability density function for the th class is given by (38) is the covariance matrix of where is the mean vector and the th class distribution. The multivariate integral in (37) can be evaluated and simplied to
therefore, that, if the channel effects are multiplicative, they become additive in the phase and, hence, the group delay domain, provided that and are each equal to 1. Generally, in the case of MFCC, the mean removal is done in the cepstral domain. In the case of the modied group delay function, owing to the artifacts introduced by and it is not clear in which domain, the mean removal must be performed. Hence, two different approaches have been tried. Ignore the cross terms, assume that channel effects are additive in the cepstral domain. Perform noise removal in the group delay domain with and set to one. Although the second approach is theoretically correct, the performance of the system (see Section VI) using the rst approach seems to be far superior. This could be due to the fact that the signal is not only corrupted by multiplicative channel effects but also additive noise. Using the argument in [7], it is important to suppress the effects of noise in the modied group delay function before it can be further processed. In the context of the second approach, the other issue would be whether the mean removal should be performed on the envelope of the modied group delay function or on the standard modied group delay function. This is similar to converting the raw Fourier spectrum into lter bank energies in the computation of MFCC. To enable this, a new parameter (see Section VI-B1) is introduced.
(39) Assuming that the feature components are independent of each other and from (39) the distance between any two feature vectors and can be computed on a component pair basis. We can, between the component pairs therefore, dene the distance of the two feature vectors and as (40) Finally, the Bhattacharya distance between the two feature vectors and with number of component pairs is given by
(41) Further, the Bhattacharya distance for a two class and a multi (M) class case as in [29], is given by (42) (43)
197
Fig. 5. Results of separability analysis. (a) Cumulative speaker separability of MODGDF and MFCC using Bhattacharya distance. (b) Cumulative language separability of MODGDF and MFCC using Bhattacharya distance.
where (42) gives the Bhattacharya distance between two class and (43) gives the Bhattacharya distance between M classes. The MODGDF shows good separability on a pairwise basis as in [10] and, therefore, we use the Bhattacharya distance measure to investigate class separability criteria. We, therefore, consider 50 speakers from the NTIMIT [17] database and compute a 16-dimensional codebook of size 32 for each speaker. Similarly, we consider 11 languages from the OGI_MLTS [20] database and compute a 16-dimensional codebook of size 32 for each language. The cumulative separability criterion based on the Bhattacharya distance measure is then calculated. The cumulative speaker separability criterion versus feature dimension for the MODGDF and the MFCC is illustrated in Fig. 5(a). The cumulative language separability criterion versus feature dimension for both the MODGDF and the MFCC is illustrated in Fig. 5(b). From Fig. 5(a) and (b), it is clear that the MODGDF is relatively better than the MFCC with respect to class separability for both the speaker and the language tasks, as the cumulative separability curve corresponding to MODGDF is above that of the MFCC. VI. PERFORMANCE EVALUATION In this section, the MODGDF is used as a front end for building automatic speaker, language, and syllable recognition systems. The various databases used in the study are also briey discussed. The procedures adopted to estimate the optimal values for the (length of the window used in the cepstral parameters domain), , and , that give the best recognition performance across all the three tasks are also described. The performance of the MODGDF is also compared with LFCC, log compressed MFCC, spectral root compressed MFCC as in [27], and energy root compressed MFCC as in [28]. A. Databases Used in the Study Since the MODGDF has been used for the tasks of syllable, speaker, language recognition there are four databases used in
the study. The databases used are the Database for Indian languages (DBIL) [19] for syllable recognition, TIMIT [16] and NTIMIT [17] for speaker identication, OGI_MLTS [20], and the DBIL [19] for language identication. 1) Database for Indian Languages (DBIL) [19]: DBIL Tamil database: This corpus consists of 20 news bulletins of Tamil language transmitted by Doordarshan India, each of 15-min duration comprising ten male and ten female speakers. The total number of distinct syllables is 2184. DBIL Telugu database: This corpus consists of 20 news bulletins of Telugu language transmitted by Doordarshan India, each of 15-min duration comprising ten male and ten female speakers. The total number of distinct syllables is 1896. 2) TIMIT Database [16]: The DARPA TIMIT AcousticPhonetic Continuous Speech Corpus was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). TIMIT contains a total of 6300 sentences, ten sentences spoken by each of 630 speakers from eight major dialect regions of the United States. 3) NTIMIT Database [17]: The NTIMIT corpus was developed by the NYNEX Science and Technology Speech Communication Group to provide a telephone bandwidth adjunct to the popular TIMIT Acoustic-Phonetic Continuous Speech Corpus. NTIMIT was collected by transmitting all 6300 original TIMIT utterances though various channels in the NYNEX telephone network and redigitizing them. The actual telephone channels used were varied in a controlled manner, in order to sample various line conditions. The NTIMIT utterances were time-aligned with the original TIMIT utterances so that the TIMIT timealigned transcriptions can be used with the NTIMIT corpus as well. 4) OGI_MLTS Database [20]: The OGI Multi-language Telephone Speech Corpus consists of telephone speech from 11 languages. The initial collection, included 900 calls, 90 calls each in ten languages and was collected by Muthusamy [20]. The languages are English, Farsi, French, German, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. It is from this initial set that the training (50), development (20) and test (20) sets were established. The National Institute of Standards and Technology (NIST) uses the same 50-20-20 set that was established. The corpus is used by NIST for evaluation of automatic language identication. B. Computation of Various Features In this section, the computation of the MODGDF and the other features like MFCC, LFCC, spectral root compressed MFCC, and energy root compressed MFCC are discussed. 1) Algorithm for Computing the Modied Group Delay Cepstra: The following is the algorithm for computing the modied group delay cepstra. followed by frame Preemphasize the speech signal blocking at a frame size of 20 ms and frame shift of 10 ms. A Hamming window is applied on each frame of the speech signal.
198
Compute the DFT of the framed and windowed speech as and the time scaled speech signal signal as . Compute the cepstrally smoothed spectra of . Let . A low-order cepstral window that this be should essentially captures the dynamic range of be chosen (see Section VI-C and VI-F). Compute the modied group delay function as
(44) where (45) The new parameters and introduced vary from 0 to 1 and . where and (see Section VI-C Set the value and VI-F for estimation of these values of and ). Other and values of and , where , can also be determined for a particular environment by using line search (see Section VI-F). Compute the modied group delay cepstra as
(46) where is the DFT order and is the modied group delay spectra. The modied group delay cepstra is referred to as the modied group delay feature (MODGDF). The velocity, acceleration and energy parameters are added to the MODGDF in a conventional manner. 2) Extraction of MFCC: The speech signal is rst pre-emphasized and transformed to the frequency domain using a fast Fourier transform (FFT). The frame size used is 20 ms and the frame shift used is 10 ms. A Hamming window is applied on each frame of speech prior to the computation of the FFT. The frequency scale is then warped using the bilinear transformation proposed by Acero [30] (47) where the constant which varies from 0 to 1, controls the amount of warping. The frequency scale is then multiplied by whose center frequencies are uniformly a bank of lters along the warped fredistributed in the interval is the minimum frequency and is quency axis. the maximum frequency which primarily decide the useful frequency range of the particular data being handled. The lter shape used at the front end is trapezoidal and its width varies from one center frequency to another. The shape of the lter is controlled by a constant which varies from 0 to 1, where 0 corresponds to triangular and 1 corresponds to rectangular. The lter bank energies are then computed by integrating the energy in each lter. A discrete cosine transform (DCT) is then used
to convert the lter bank log energies to cepstral coefcients. Cepstral mean subtraction is always applied when working with telephone speech. A perceptually motivated lter design is also used as in [14]. The front end parameters are tuned carefully as in [14] for computing the MFCC so that the best performance is achieved. The LFCC are computed in a similar fashion except that the frequency warping is not done as in the computation of the MFCC. The velocity, acceleration and the energy parameters are added for both the MFCC and LFCC in a conventional manner. 3) Extraction of Spectral Root and Energy Root Compressed MFCC: The spectral root compressed MFCC are computed as described in [27] and the energy root compressed MFCC as in [28]. The computation of the spectral root compressed MFCC is the same as the computation of the MFCC except that instead of taking a log of the FFT spectrum we raise the FFT spectrum to a power where the value of ranges from 0 to 2. In the computation of the energy root compressed MFCC instead of raising the FFT spectrum to the root value, the Mel frequency lter bank energies are compressed using the root value. In the energy root compressed case, the value of the root used for compression can range from 0 to 1. It is emphasized here that the front end parameters involved in the computation of both these features including the root value have been tuned carefully so that they give the best performance and are not handicapped in any way when they are compared with the MODGDF. The value of the spectral root and the energy root used in the experiments are 2/3 and 0.08, respectively. The velocity, acceleration and the energy parameters are augmented to both forms of the root compressed MFCC in a conventional manner. C. Estimation of Optimal Values for , , and
In this section, the optimal values of the three free parameters , , and used in the computation of the MODGDF are estimated from a signal processing perspective. We rst x the range of values that these three parameters can take. can vary from 4 The length of the cepstral window to 9 for capturing the envelope of the speech spectrum. The parameter can vary between 0 and 1. The parameter can vary between 0 and 1. We substantiate the above conjectures from a signal processing viewpoint in the following Sections VI-D and VI-E. D. Estimation of Optimal Values for As discussed in Section III-A, the problem of restoring the resonance structure of the signal with the modied group delay funcby a near at spection is reduced to the estimation of [see (26)]. In practice, has to be estimated trum around the zeros have from the signal. The values of to be preserved so that they cancel the small values in the denominator of the rst term in (18). The selection of the length of , used for cepstral smoothing is crucial in obthe window taining the best recognition performance. The length of is selected to vary from 4 to 9. A series of initial experiments conducted for phoneme recognition in [9] showed that any value greater than 9 hurts performance. From a signal proof as in (26) should result cessing perspective, the estimate in a at spectrum. In this section, we show that even for smaller
199
Fig. 6. Comparison of the estimated at spectrum for a speech signal using different cepstral window lengths. (a) and (b) A short segment of speech sampled at 16 KHz. (c) The squared magnitude spectrum S (! ) and its cepstrally smoothed version of S (! ) for a value of lifter = 6. (d) The squared magnitude spectrum S (!) and its cepstrally smoothed version of S (!) for a value of lifter = 16. (e) The estimated at spectrum E (! ) for a value of lifter = 6. (f) The estimated at spectrum E (! ) for a value of lifter = 16.
lengths (4 to 9) of the estimated spectrum is indeed at, by considering a short segment of speech sampled at 16 KHz. The short segment of speech considered for analysis is shown in Fig. 6(a) and (b). The squared magnitude spectrum and its cepstrally smoothed version of are shown in and , reFig. 6(c) and (d) for a value of spectively. Fig. 6(e) shows the estimated at spectrum for a . The estimated at spectrum for a value value of is shown in Fig. 6(f). From Fig. 6(e) and (f), it is of is indeed at for shorter clear that the estimated spectrum window lengths (4 to 9). E. Estimation of Optimal Values for and
Fig. 7. Estimation of optimal lifter , , and from a signal processing perspective (a) z -plane plot of a system characterized by four formants (four complex conjugate pole pairs). (b) Impulse response of the system shown in (a). (c) Response of the system in (a) excited with ve impulses spaced 60 apart. (d) Group delay spectrum of the response in (a). (e) Group delay spectrum of the response in (d). (f) Modied group delay spectrum of the response in (d) for lifter = 6, = 1, and = 1. (g) Mean square error plot for and (varied in steps of 0.1). (h) Modied group delay spectrum of the response in (d) for lifter = 6, = 0:4, and = 0:9.
In this analysis, the value of the is xed at 6 although any variation from 4 to 9 has little effect on the envelope of the modied group delay spectra as discussed in Section VI-D. The effects of pitch periodicity make the modied group delay function spiky at formant locations [31]. Hence, in order to x the values of and , we consider a system characterized by four formants (four complex conjugate pole pairs) as in Fig. 7(a). The system in Fig. 7(a) is excited with an impulse and the corresponding impulse response2 is shown in Fig. 7(b). From a signal processing perspective, this is equivalent to a signal with a single pitch period. The group delay spectrum of the response in Fig. 7(b) is shown in Fig. 7(d). The system in Fig. 7(a) is excited with a train of ve impulses spaced apart by 60 samples, and the corresponding impulse response is shown in Fig. 7(c). From a signal processing perspective, this is equivalent to a signal with ve pitch periods. The group delay spectrum of the response in Fig. 7(c) is shown in Fig. 7(e) and has no structure or formant information. Fig. 7(f) shows the envelope of modi, , and . ed group delay spectrum for , , and , the envelope It is clear that for these values of of the spectrum of speech is incorrectly captured by the modied group delay spectrum. Hence, the values of and need
2Sampling
to be xed such that the formant locations are indeed captured by the modied group delay function as in the case of the minimum phase group delay function shown in Fig. 7(d). A minimization of mean square error approach is used to nd the optimal values for and . Let the minimum phase group delay and the modied group delay function be denoted by function by . The minimum phase group delay function shown in Fig. 7(d) serves as a reference template and is computed for varthe modied group delay function ious values of , and . The parameters and are varied in steps of 0.1 over the range 0 to 1. The mean-square error (MSE) and is given by between (48) where (49) and is the length of or . The corresponding mean square error plot for and (varied in steps of 0.1) over the range 0 to 1 is shown in Fig. 7(g). The error plot converges and . The error to a global minima at a value of from 4 to 9. The curve does not change for lengths of and envelope of the modied group delay spectrum for is shown in (7h) and it is able to capture the formant information correctly. F. Estimation of Optimal Values for Line Search , , and Using
rate = 10 kHz.
A series of experiments conducted initially showed that xing , , and arbitrarily the values of the three parameters
200
TABLE I SERIES OF EXPERIMENTS CONDUCTED ON VARIOUS DATABASES WITH THE MODGDF
TABLE III RECOGNITION PERFORMANCE OF VARIOUS FEATURES FOR SPEAKER IDENTIFICATION. MODGDF (MGD), MFCC (MFC), LFCC (LFC), SPECTRAL ROOT COMPRESSED MFCC (SRMFC), ENERGY ROOT COMPRESSED MFCC (ERMFC), AND SPECTRAL ROOT COMPRESSED LFCC (SRLFC)
TABLE II BEST FRONT-END FOR THE MODGDF ACROSS ALL TASKS AND ACROSS ALL DATABASES USED IN THIS STUDY
will have an impact on the recognition error rate in all the three speech processing tasks mentioned earlier. Based on the results to 8 of these initial experiments we x the length of the although the performance remains nearly the same for lengths from 4 to 9. Any value greater than 9 hurts performance badly. , the task now is to x the Having xed the length of the values of and . In order to estimate the values of and an extensive optimization was carried out in [9] for the SPINE database [32] for phoneme recognition. To ensure that the optimized parameters were not specic to a particular database, we collected the sets of parameters that gave best performance on the SPINE database as in [9] and tested them on other databases like the DBIL database (for syllable recognition), TIMIT, NTIMIT (for speaker identication), and the OGI_MLTS database (for language identication). The values of the parameters that gave the best performance across all databases and across all tasks were nally chosen for the experiments. The optimization technique uses successive line searches. For each iteration, is held constant, and is varied from 0 to 1 in increments of 0.1 (line search) and the recognition rate is noted for the three tasks on the aforementioned databases. The value of that maximizes the recognition rate is xed as the optimal value. A similar line search is performed on (varying it from 0 to 1 in increments of 0.1) keeping xed. The set of values of and that give the lowest error rate across the three tasks is retained. The series of experiments conducted to estimate the optimal values for , and using line search are summarized in Table I. Based on the experiments conducted as in Table I, the best front end across all tasks and across all databases used in this study is , given in Table II. It is emphasized here that the values of , and listed in Table II are used for the evaluation of the MODGDF for all the three tasks namely speaker, language, and syllable recognition. G. Baseline System and Experimental Results for Automatic Speaker Identication The baseline system used in this study uses the principle of likelihood maximization. A series of Gaussian mixture models (GMMs) are used to model the voices of speakers for whom training data is available [14]. Single state, 64 mixture GMMs are trained for each of the 600 speakers in the database. A classier evaluates the likelihoods of the unknown speakers voice data against these models. The model that gives the maximum
accumulated likelihood is declared as the correct match. Out of the ten sentences for each speaker, six were used for training, and four were used for testing. The tests were conducted on 600 speakers (600 four tests) and the number of tests was 2400. A summary of results of performance evaluation for various features on both the TIMIT [16] (clean speech data) and NTIMIT [17] (noisy telephone data) corpora using the GMM scheme are listed in Table III. 1) Discussion: For the TIMIT data the MODGDF gave a recognition performance of 99%. The performance of the MODGDF for this task is better than that of the spectral root at 97.25%, the log comcompressed MFCC root pressed MFCC at 98% and energy root compressed MFCC root at 98% as indicated in Table III. For the NTIMIT data, the MODGDF gave a recognition performance of 36%. The performance of the MODGDF for this task is better than at that of the spectral root compressed MFCC root 34.25%, the log compressed MFCC at 34% and energy root at 34.75% as indicated in compressed MFCC root Table III. The performance of the the two forms of LFCCs is also listed in Table III. It is emphasized again that the value of the root in both forms of the root compressed features has been taken after careful optimization using line search. H. Baseline System and Experimental Results for Language Identication The baseline system used for this task is very similar to the system used for the automatic speaker identication task, except that each language is now modeled by a GMM. Single state, 64 mixture GMMs are trained for each of the 11 languages in the database. Out of 90 phrases for each language, 45 were used for training and 20 were used for testing. The length of the test utterance was 45 s. The average recognition performance across three languages for the three-language task and across 11 languages for the 11-language task is computed. A summary of the results of performance evaluation for various features on both DBIL and OGI_MLTS corpora using the GMM scheme are listed in Table IV. 1) Discussion: For the three-language task on the DBIL database the MODGDF gave a recognition performance of 96%. The performance of the MODGDF for this task is better than at that of the spectral root compressed MFCC root
201
TABLE IV RECOGNITION PERFORMANCE OF VARIOUS FEATURES FOR LANGUAGE IDENTIFICATION. MODGDF (MGD), MFCC (MFC), LFCC (LFC), SPECTRAL ROOT COMPRESSED MFCC (SRMFC), ENERGY ROOT COMPRESSED MFCC (ERMFC), AND SPECTRAL ROOT COMPRESSED LFCC (SRLFC)
TABLE V SYLLABLE RECOGNITION ACCURACY (SRA) RESULTS OF VARIOUS FEATURES. MODGDF (MGD), MFCC (MFC), LFCC (LFC), SPECTRAL ROOT COMPRESSED MFCC (SRMFC), ENERGY ROOT COMPRESSED MFCC (ERMFC), AND SPECTRAL ROOT COMPRESSED LFCC (SRLFC)
95%, the log compressed MFCC at 95% and energy root comat 95.4% as indicated in Table IV. pressed MFCC root For the 11-language task on the OGI_MLTS data, the MODGDF gave a recognition performance of 53%. The performance of the MODGDF for this task is better than that of the spectral root at 50.4%, the log compressed compressed MFCC root MFCC at 50% and energy root compressed MFCC root at 50.6% as indicated in Table IV. The performance of the the two forms of LFCCs is also listed in Table IV. It is emphasized that the value of the root in both forms of the root compressed features has been selected after extensive optimization using line search. I. Baseline System and Experimental Results for Syllable Recognition The baseline recognition system uses hidden Markov models (HMMs) trained a priori for 320 syllables for Tamil and 265 syllables for Telugu. The number of syllables used for training are selected based on their frequency of occurrence in the respective corpora. During the training phase, HMMs are built for every syllable that occurs more than 50 times in the corpus. A separate model is built for silence. Five state HMMs with three mixtures/state are used throughout the experimental study. During the testing phase, the test sentence is segmented at boundaries of syllabic units using the minimum phase group delay function derived from the causal portion of the root compressed energy function assuming that it is an arbitrary magnitude spectrum exactly as in [24]. These segments are now checked in isolated style against all HMMs built a priori. The HMM that gives the maximum likelihood value is declared as the correct match. The recognized isolated syllables are now concatenated in the same order as they appear in the test sentence to output the recognized sentence. The syllable recognition accuracy (SRA) results using the baseline system for two news bulletins each of duration 15 min, comprising 9400 syllables for Tamil and Telugu are illustrated in Table V. where The baseline SRA is given by number of correctly recognized syllables, is the number is the total number of syllables in the of insertions, and sentence that is recognized. It should be noted that the SRA is
raw syllable recognition accuracy without using any language models. 1) Discussion: For the Telugu data the MODGDF gave a baseline SRA of 38.2%. The SRA of the MODGDF for this task is better than that of the spectral root compressed MFCC root at 35.6%, energy root compressed MFCC root at 38%, slightly less than the log compressed MFCC at 38.6% as indicated in Table V. For the Tamil data the MODGDF gave an SRA of 36.7%. The SRA of the MODGDF for this task is better than that of the spectral root compressed MFCC root at 34.1%, energy root compressed MFCC root at 36.5% slightly less than the log compressed MFCC at 37.1% as indicated in Table V. The SRA of the the two forms of LFCCs is also listed in Table V. It is emphasized here that the value of the root in both forms of the root compressed features has been selected after careful optimization using line search. VII. CONCLUSION AND SCOPE FOR FUTURE WORK The group delay function and its signicance in speech processing has been discussed in earlier efforts. The idea of extracting features from the modied group delay function based on the STFT phase spectra for speech recognition has been investigated in this paper. The denition of the group delay func, tion is modied by the introduction of the parameters , and . Cepstral features are derived from the modied group delay function, which are decorrelated and relatively robust to channel mismatch and noise. These features (MODGDF) have been used for three speech recognition tasks, namely speaker identication, language identication, and syllable recognition for the rst time. It is evident from the results presented in this work that the MODGDF captures the dynamic information of the speech signal. The results suggest that the MODGDF as a feature can be used in practice across all speech recognition tasks. Although the signicance of the MODGDF for various speech recognition tasks is discussed in this paper, there are certain issues that need attention. It is illustrated that the MODGDF outperforms the MFCC in all the feature evaluation criteria. However, in terms of recognition performance the gains of the MODGDF over the MFCC are not spectacular. This is one issue that needs to be analyzed. The other issue is that of warping in the modied group delay domain. The issue of warping should
202
be analyzed both theoretically and analytically. The mathematical relation between the modied group delay spectrum and the short term power spectrum of speech is another issue that needs to be understood.
REFERENCES
[1] J. W. Picone, Signal modeling techniques in speech recognition, Proc. IEEE, vol. 81, no. 9, pp. 12151247, Sep. 1993. [2] S. S. Stevens and J. Volkman, The relation of pitch to frequency, Amer. J. Psychol., vol. 53, no. 3, pp. 329353, Jul. 1940. [3] P. Mermelstein and S. B. Davis, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357366, Aug. 1980. [4] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. [5] R. Schluter and H. Ney, Using phase spectrum information for improved speech recognition performance, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., May 2001, vol. 1, pp. 711. [6] H. A. Murthy, Algorithms for processing Fourier transform phase of signals, Ph.D. dissertation, Dept. Comput. Sci. Eng., Indian Inst. Technol., Madras, India, 1992. [7] B. Yegnanarayana and H. A. Murthy, Signicance of group delay functions in spectrum estimation, IEEE Trans. Signal Process., vol. 40, no. 9, pp. 22812289, Sep. 1992. [8] B. Yegnanarayana, D. K. Saikia, and T. R. Krishnan, Signicance of group delay functions in signal reconstruction from spectral magnitude or phase, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 3, pp. 610622, Jun. 1984. [9] H. A. Murthy and V. R. R. Gadde, The modied group delay function and its application to phoneme recognition, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Hong Kong, China, Apr. 2003, vol. I, pp. 6871. [10] R. M. Hegde, H. A. Murthy, and V. R. R. Gadde, Application of the modied group delay function to speaker identication and discrimination, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Montreal, QC, Canada, May 2004, vol. 1, pp. 517520. [11] , Speech processing using joint features derived from the modied group delay function, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, Mar. 2005, vol. 1, pp. 541544. [12] P. Yip and K. R. Rao, Discrete Cosine Transform: Algorithms, Advantages and Applications. Norwell, MA: Academic, 1997. [13] H. Gish and M. Schmidt, Text independent speaker identication, IEEE Signal Process. Mag., vol. 11, no. 4, pp. 1832, Oct. 1994. [14] H. A. Murthy, F. Beaufays, and L. P. Heck, Robust text-independent speaker identication over telephone channels, IEEE Trans. Signal Process., vol. 7, no. 5, pp. 554568, Sep. 1999. [15] D. A. Reynolds, Large population speaker identication using clean and telephone speech, IEEE Signal Process. Lett., vol. 2, no. 3, pp. 4648, Mar. 1995. [16] NTIS, The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993. [17] C. Jankowski, A. Kalyanswamy, S. Basson, and J. Spitz, NTIMIT: A phonetically balanced, continuous speech, telephone bandwidth speech database, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 1990, pp. 109112. [18] L. Besacier and J. F. Bonastre, Time and frequency pruning for speaker identication, in Proc. Int. Conf. Pattern Recognition, 1998, pp. 16191621. [19] Database for Indian Languages Speech and Vision Lab, IIT Madras, Chennai, India, 2001. [20] Y. K. Muthusamy, R. A. Cole, and B. T. Oshika, The OGI multilanguage telephone speech corpus, in Proc. Int. Conf. Spoken Lang. Process., Oct 1992, pp. 895898. [21] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 2000. [22] H. A. Murthy and B. Yegnanarayana, Formant extraction from group delay function, Speech Commun., vol. 10, pp. 209221, 1991. [23] B. Yegnanarayana, Formant extraction from linear prediction phase spectrum, J. Acoust. Soc. Amer., pp. 16381640, 1978. [24] V. K. Prasad, T. Nagarajan, and H. A. Murthy, Automatic segmentation of continuous speech using minimum phase group delay functions, Speech Commun., vol. 42, pp. 429446, 2004.
[25] K. V. M. Murthy and B. Yegnanarayana, Effectiveness of representation of signals through group delay functions, Signal Process., vol. 17, pp. 141150, 1989. [26] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals. Englewood Cliffs, NJ: Prentice-Hall, 1978. [27] P. Alexandre and P. Lockwood, Root cepstral analysis: A unied view. Application to speech processing in car noise environments, Speech Commun., vol. 12, no. 3, pp. 277288, Jul. 1993. [28] R. Sarikaya and J. H. L. Hansen, Analysis of the root-cepstrum for acoustic modeling and fast decoding in speech recognition, in Proc. Eurospeech, Sep. 2001, pp. 687690. [29] K. Fukunaga, Introduction to Statistical Pattern Recognition. Boston, MA: Academic, 1990. [30] A. Acero, Acoustical and environmental robustness in automatic speech recognition, Ph.D. dissertation, Carnegie Mellon Univ., Pittsburgh, PA, 1990. [31] H. A. Murthy and B. Yegnanarayana, Speech processing using group delay functions, Signal Process., vol. 17, pp. 141150, 1989. [32] V. R. R. Gadde, A. Stolcke, J. Z. D. Vergyri, K. Sonmez, and A. Venkatraman, The SRI SPINE 2001 Evaluation System. Menlo Park, CA: SRI, 2001. Rajesh M. Hegde received the B.E degree in instrumentation and electronics engineering from the University of Mysore, Mysore, India, in 1991, the M.E degree in electronics engineering from Bangalore University, Bangalore, India, in 2000, and the Ph.D degree in computer science and engineering from the Indian Institute of Technology Madras (IIT-M), Chennai, India, in 2005. He is currently working as a Postdoctoral Researcher at the California Institute of Telecommunication and Information Technology (CALIT2), University of California at San Diego, La Jolla. At CALIT2, he is involved in a project that is aimed at generating situational awareness from multimodal inputs and disseminating information during crisis and disaster scenarios. His research interests include feature extraction for speech recognition, speaker identication, audiovisual speech recognition, and event detection from multimodal inputs.
Hema A. Murthy (M94) received the B.E. degree in electronics and communications engineering from Osmania University, Hyderabad, India, in 1980, the M.Eng degree in electrical and computer engineering from McMaster University, Montreal, QC, Canada, in 1986, and the Ph.D. degree in computer science and engineering from the Indian Institute of Technology Madras, (IIT-M), Chennai, India, in 1992. From 1980 to 1983, she was a Scientic Ofcer with the Speech and Digital Systems Group, Tata Institute of Fundamental Research, Bombay, India. In 1988, she joined the faculty of the Department of Computer Science and Engineering, IIT-M, India. From 1995 to 1996, she was a Postdoctoral Fellow at the Speech Technology and Research Laboratory, SRI International. She is currently with the Department of Computer Science and Engineering, IIT-M. Her research interests include speech signal processing, speech and speaker recognition, handwriting recognition, and computer networks. She has also been actively involved in education as a means of empowerment for the marginalized sections of her society.
Venkata Ramana Rao Gadde received the B.Tech. degree in electronics and electrical communications from the Indian Institute of Technology Kharagpur, India, in 1982 and the M.Tech. and Ph.D. degrees in computer science from the Indian Institute of Technology Madras (IIT-M), Chennai, India, in 1986 and 1994, respectively. He is a Senior Research Engineer at the Speech Technology and Research (STAR) Laboratory, SRI International, Menlo Park, CA. From 1988 to 1997 he was a member of the faculty at IIT-M. His research interests include speech technology, image processing, statistical modeling, and robust systems.

Significance of The Modified Group Delay Feature in Speech Recognition

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Significance of The Modified Group Delay Feature in Speech Recognition

Hochgeladen von

Copyright:

Verfügbare Formate

190

Signicance of the Modied Group Delay Feature in Speech Recognition

(1) The STFT can also be expressed as

1558-7916/$20.00 2006 IEEE

HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE

(8) and the phase spectrum is given by

HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE

(19) The group delay function for in (18) can be written as

HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE

HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE

HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE

TABLE I SERIES OF EXPERIMENTS CONDUCTED ON VARIOUS DATABASES WITH THE MODGDF

HEGDE et al.: SIGNIFICANCE OF THE MODIFIED GROUP DELAY FEATURE

Das könnte Ihnen auch gefallen