A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise

822
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011
A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise

Ji Ming, Member, IEEE, Ramji Srinivasan, Member, IEEE, and Danny Crookes, Member, IEEE
AbstractTemporal dynamics and speaker characteristics are two important features of speech that distinguish speech from noise. In this paper, we propose a method to maximally extract these two features of speech for speech enhancement. We demonstrate that this can reduce the requirement for prior information about the noise, which can be difcult to estimate for fast-varying noise. Given noisy speech, the new approach estimates clean speech by recognizing long segments of the clean speech as whole units. In the recognition, clean speech sentences, taken from a speech corpus, are used as examples. Matching segments are identied between the noisy sentence and the corpus sentences. The estimate is formed by using the longest matching segments found in the corpus sentences. Longer speech segments as whole units contain more distinct dynamics and richer speaker characteristics, and can be identied more accurately from noise than shorter speech segments. Therefore, estimation based on the longest recognized segments increases the noise immunity and hence the estimation accuracy. The new approach consists of a statistical model to represent up to sentence-long temporal dynamics in the corpus speech, and an algorithm to identify the longest matching segments between the noisy sentence and the corpus sentences. The algorithm is made more robust to noise uncertainty by introducing missing-feature based noise compensation into the corpus sentences. Experiments have been conducted on the TIMIT database for speech enhancement from various types of nonstationary noise including song, music, and crosstalk speech. The new approach has shown improved performance over conventional enhancement algorithms in both objective and subjective evaluations. Index TermsCorpus-based speech modeling, longest matching segment, nonstationary noise, speech enhancement, speech separation.
I. INTRODUCTION
SPEECH signal has two distinct features: its temporal dynamics, subject to acoustic, lexical, and language constraints, and its speaker characteristics. These two features distinguish a speech sentence from non-speech noise, and from other speakers sentences. In this paper, we propose a method to maximally extract these two features of speech for retrieving speech from noise, including crosstalk interference. We aim to reduce the requirement for prior information about the noise, as this can be difcult to estimate with
Manuscript received January 28, 2010; revised May 18, 2010; accepted July 21, 2010. Date of publication August 09, 2010; date of current version February 14, 2011. This work was supported by the U.K. EPSRC under Grant EP/G001960/1. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Susanto Rahardja. The authors are with the School of Electronics, Electrical Engineering, and Computer Science, Queens University Belfast, Belfast BT7 1NN, U.K. (e-mail: j.ming@qub.ac.uk; r.srinivasan@qub.ac.uk; d.crookes@qub.ac.uk). Digital Object Identier 10.1109/TASL.2010.2064312
fast-varying noise and crosstalk. We assume the availability of only single-channel data. Current single-channel techniques for speech enhancement typically include optimal ltering and optimal estimation. In optimal ltering, no specic knowledge about the speech is assumed, except its independence of the noise; examples include spectral subtraction [1] or Wiener ltering [2], [3]. In optimal estimation, a priori knowledge of the probability distribution of the speech is assumed and this is used to derive the estimators, for example, minimum mean-square error (MMSE) [4], maximum a posteriori (MAP) [5], [6], or perceptually weighted Bayesian spectral estimators [7]. In optimal estimation, parametric statistical models such as Gaussian, Gamma, Laplacian, or super Gaussian have found use in representing the probability distribution of the speech discrete Fourier transform (DFT) coefcients or spectral amplitudes (see, for example, [4], [6], [8][10]). Data-driven models, such as vector quantization (VQ) codebooks or Gaussian mixture models (GMMs), have also been used to provide the speech priors required in optimal estimators (e.g., [11], [12]). Additionally, subspace approaches have been used in speech enhancement, which project a noisy signal onto two subspaces: signal and noise; noise reduction is achieved by retraining only the signal-subspace projection, usually modied by ltering or speech prior (e.g., [13][15], [41]). All these techniques require prior knowledge about the noise, typically, the noise variance or power spectral density, or the instantaneous signal-to-noise ratio (SNR), at all times. When the required noise-related statistics are not available, they are predicted by using neighboring observations without signicant speech content based on voice activity detection, minimum statistics, time-recursive averaging, and more recently MMSE-based high-resolution noise DFT estimation, and their combination (see, for example, [16][21]). Most noise estimation algorithms work well for stationary or slowly varying noise, but less so for heavily nonstationary noise. This is because of the weak predictability of fast-varying noises. Because of the nonstationary nature of the speech signal, most current enhancement algorithms operate on a frame-by-frame basis. Many algorithms ignore the temporal constraints between adjacent speech frames. Without context, and without specic knowledge about the noise, it can be difcult to separate the speech from noise in the duration of a frame (typically about 20 ms). This is especially true when the noise is a form of speech (e.g., a crosstalk sentence). Previous research has revealed the importance of imposing cross-time spectral constraints in improving speech enhancement quality (e.g., [10], [22]). As part of the effort towards data-driven priors, hidden Markov models (HMMs) trained on realistic speech data have been used to provide the speech priors to form the
1558-7916/$26.00 2010 IEEE
MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE
823
estimators (see, for example, [23][27]). HMMs can represent a class of nonstationary random processes. However, their state dynamics, under the rst-order Markov chain assumption, are unrealistic for representing the temporal dynamics of speech, which decide how short-time sounds can be concatenated one to another to form a realistic speech sentence. Other alternatives include the use of state space models, mostly of order one, to model the evolution of the speech parameters over time. This results in the Kalman ltering algorithms (e.g., [28], [29]). With few exceptions (e.g., [27]), most current algorithms for speech enhancement do not explicitly model the speaker characteristics of the target speech. In this paper, we study the problem of retrieving speech from nonstationary noise assuming minimal noise prior. This assumption applies to heavily nonstationary noise, such as song, music, or crosstalk speech, which can be difcult to predict with conventional noise estimation algorithms. We describe an approach aiming to maximally extract the two important features of speechtemporal dynamics and speaker-class characteristicsfor its separation from noise. We achieve this through recognition of long speech segments as whole units from noise. Specically, we use clean speech sentences, taken from a corpus, to provide examples of up to sentence-long temporal dynamics and speaker-class characteristics for the target speech. Given a noisy sentence, we identify the matching segments between the noisy sentence and the corpus sentences, to seek an estimate of the target sentence by using clean corpus segments. We form the estimate by using the longest matching segments found, assuming that longer speech segments as whole units can be identied more accurately from noise than shorter speech segments because of their more distinct temporal dynamics and richer speaker characteristics. This theory is supported by speech recognition experiments, which reveal that higher recognition accuracy can be achieved for larger primary units (e.g., syllables or words) than for smaller primary units (e.g., phones), and for speaker or speaker-class specic modeling than for speaker-independent modeling, in the presence of noise. Therefore, maximizing the target speech segments to be identied as whole units can effectively reduce the identication error without assuming specic information about the noise. This work is an extension of our previous work for the Pascal Speech Separation Challenge [34], which dealt with small-vocabulary, known-speaker and grammatically restricted speech. The extension includes two parts. First, we lift those limitations in the Challenge and deal with free-text and free-speaker speech. Second, we combine our previous method of missing-feature based noise compensation [35], [36] into the identication of the matching segments, to further improve the robustness to noise. For convenience, we call the new approach the longest matching segment (LMS) approach. Sections IIVII present details of the new LMS approach, in three parts. The rst part, in Section II, describes a method used in the LMS approach to model the corpus sentences for up to sentence-long temporal dynamics, to facilitate the maximum use of the corpus sentences for segment identication. The second part, in Sections III and IV, describes an algorithm for identifying the longest matching segments between the noisy
sentence and the corpus sentences, as a means of increasing the speech estimation accuracy assuming minimal information about the noise. The missing-feature based noise compensation method is incorporated into the identication, for further improving noise robustness. The last part, in Section V, describes the algorithms for reconstructing the target speech based on the longest matching segments found in the clean corpus sentences. Experimental results, containing examples of speech enhancement from various types of nonstationary noise and crosstalk speech, are presented in Section VI. Finally, conclusions are drawn in Section VII.
II. MODELING LONG-RANGE TEMPORAL DYNAMICS OF SPEECH We use a speech corpus, consisting of prerecorded clean speech sentences by various speakers, to provide the required free-speaker and free-text acoustic, lexical, and language constraints for the target speech. A reasonably sized speech database, as normally used to develop HMM systems for large-vocabulary speaker-independent speech recognition, could suit the purpose. Each sentence in the corpus will serve, simultaneously, as an acoustic model of a speech process that may be partly or completely realized in the target sentence, and as a text-dependent model of the acoustic characteristics of the target speaker class. To facilitate the maximum use of the constraints, we will model the complete temporal dynamics in each corpus sentence, such that any segment of any length in the sentence, up to the complete sentence, can be used as a whole unit to identify the corresponding units/segments in the target speech. Because of their more distinct dynamics and richer speaker characteristics, longer speech segments as whole units can be recognized with lower error rates than shorter segments. Therefore, estimation based on the longest recognized segments enhances the noise immunity. We use a new example-based approach, as opposed to conventional templates, to build the corpus sentence models. The new approach has three steps. First, we divide each corpus sentence into short-time frames, and model each frame with a feature vector suitable for identifying the matching frame given a noisy measurement (we use subband-based cepstral coefcients as the feature vector, to be detailed later). Second, we train a GMM for the frame feature vectors using all the corpus sentences. Denote by the GMM trained on all the corpus data for frame vectors (1) where is the th Gaussian component and is the corresponding weight. Finally, based on , we build a model for each corpus sentence to represent the complete temporal dynamics in the sentence. Let be a corpus sentence with frames and being the frame at time . We obtain a new representation for by taking each frame from and nding the Gaussian component in that maximizes the
824
likelihood of the frame. This results in a time sequence of maximum-likelihood Gaussian components, which can be expressed using the corresponding time sequence of indices (2) is an index addressing a Gaussian component , that produces maximum likelihood for the th frame, , in corpus sentence . We will use as a model for corpus sentence . In the model, the individual Gaussian components represent the probability distributions of the short-time speech spectra that form this sentence, and the time sequence of the Gaussian components captures the full temporal dynamics, from acoustic to lexical and to language, that join together the appropriate short-time spectra to form the specic sentence. The model also captures the acoustic characteristics of the speaker class embedded in the sentence, in a text-dependent mode. Recently, there has been renewed interest in using a do-nothing approach, for example, templates, in place of the heavy-handed HMM or GMM approach for speech and speaker recognition (e.g., [30][32]). Templates make fewer assumptions/manipulations on the speech data, and thus can be more accurate than an HMM in representing the long-range temporal dynamics of a speech sentence. However, unlike the statistics-based HMM or GMM, templates lack statistical smoothness (and hence robustness) in representing the short-time speech spectra, which are subject to random variations. The above sentence model (1) and (2) represents a balance between these two approaches. It combines statistical and template-based approaches seamlessly in the same framework, to offer both a smooth representation of the short-time spectra and a sentence-long representation of the temporal dynamics. where in III. IDENTIFYING MATCHING SEGMENTS WITH LARGE CONTINUITIES be a noisy test sentence with Let frames and being the frame at time . In our system, the problem of speech enhancement can be stated as identifying for , such that the each noisy frame a matching corpus frame underlying target speech frame can be reconstructed using the . Since a clean corpus frame modeled by the Gaussian segment of consecutive speech frames, when treated as a whole unit, can be identied more accurately from noise than the individual frames, we seek the longest matching segments between the noisy sentence and the corpus sentences, as a means of maximizing the identication accuracy without assuming specic information about the noise. In the following, we describe an algorithm for identifying the longest matching segments, based on the corpus sentence models described in Section II. represent a test segment Let taken from test sentence and consisting of consecutive frames from time to . Let represent a corpus segment taken from model and modeling consecutive frames from to in corpus sentence . We measure the similarity between the two segments by using the poste-
given the test segrior probability of the corpus segment . Assuming an equal prior probability for all speech ment , the posterior probability may segments that may match be expressed as
(3) where is the likelihood of the test segment associated with the corpus segment . This likelihood can be calculated by using the Viterbi algorithm, assuming that the frames within a segment are conditionally independent of one can be expressed as another. Thus, (4) where represents the most-likely time warping path between and the corpus segment , assuming the test segment that and . We can use the standard DTW (dynamic time warping) continuity conditions to constrain the allowable warping path and ratio of lengths of the test and corpus segments [33]. The denominator of (3) includes two terms. The rst term corresponds to the collection of all the corpus segments, taken from all the corpus sentences with all possible segment locations . and lengths, that are likely to match the given test segment The second term, denoted by , corresponds to the , as a whole unit, matches a segment that is likelihood that not included in the corpus sentences. This likelihood of unseen segments may be suitably approximated by using the corpus GMM , expressed in (1). The following shows the representation used in our algorithm, which calculates the likelihood of associated with (5) In (5), the sum inside the brackets is simply the corpus GMM based likelihood for frame . In other words, if we view the segmental temporal dynamics as text dependence, then (4) gives a text-dependent likelihood, while (5) gives a text-independent likelihood, of the test segment. Test segments without whole matching corpus segments will result in low text-dependent likelihoods but not necessarily low text-independent likelihoods, and hence low posterior probabilities. and , we can assume that the For good matching text-dependent likelihood is larger than the text-indepen. This is dent likelihood, i.e., because
(6)
825
and . The second approximation is based on the assumption that, for good matching test and corpus segments, the likelihood is dominated by the Gaussian components forming the most-likely warping path. Thus, with (3) and (5), we and can obtain a larger posterior probability when match, and a smaller posterior probability when and mismatch, as desired. The posterior probability has another important characteristic: it favors the continuity of the matching segments, in terms of giving larger values for longer matching and . To show this, assume that and are a pair of matching segments such that the following likelihood infor equalities are observed: , and . any as a union of two consecutive subsegments Express and the complement , and as a union of the correand . sponding matching corpus subsegments We can have the following likelihood ratio inequality:
Assuming that noise has a smaller effect on the correct recognition of the matches of longer speech segments, we will use the found at all the frame longest matching corpus segments times to form an estimate of the underlying target speech. Before discussing the details of forming the estimate, in Section IV we describe the combination of missing-feature based noise compensation into the above models and algorithms. The aim of this is to further improve the noise robustness for identifying the underlying matching speech segments using noisy speech. IV. NOISE COMPENSATION We consider noise compensation without assuming specic knowledge about the noise. We achieve this by combining multicondition model training and optimal feature selection. We call the method missing-feature based noise compensation, which has been studied previously within the HMM and GMM frameworks for robust speech and speaker recognition [35], [36]. In this section, we extend this method to the new LMS framework. In the last section, we use the posterior probability to measure the similarity of a clean corpus to a noisy test segment , considering segment as an estimate of the underlying target speech segment . We can make this measure more robust to the noise in in assuming minimal prior knowledge about the noise. This can be achieved by combining two steps. In the rst step, we simulate the noise in the test sentence by adding variable forms of noise to the clean corpus sentences. As such, we compare the noisy test sentence with the noisy corpus sentences to reduce the GMM trained the noise-caused mismatch. Denote by using the corpus data corrupted at noise condition , which can be expressed as
(7) This is because based on the assumption that and match. In a similar way, we can have an inequality concerning the likelihood ratio associated with the out-of-corpus segments: (8) Rewriting the posterior probability (3) as a function of the appropriate likelihood ratios, and applying the above two inequalities (7) and (8) to the expression, we can obtain an inequality concerning the posterior probability: (9) Inequality (9) indicates that the posterior probability increases with the lengthening of the matching segments, when compared as whole units. Therefore, we can use the maximum values of the posterior probability to locate the longest matching segments between the test sentence and the corpus sentences, to be used for speech estimation. . Consider a noisy test sentence At each frame time , we can nd a longest test segment from and the corresponding matching corpus segment, denoted as , by maximizing the posterior probability. This can be expressed as
(11) where represents the th Gaussian component, which corresponds to the clean component in (1) and is estimated using the corresponding frames corrupted at noise condition . Based on , the following time sequence, which is an extension of (2), can be used to represent a corpus sentence corrupted at noise condition
(12) where denes a sequence of maximum-likelihood Gaussian components in , that can be used to model both the short-time spectra and the temporal dynamics of the noisy sentence . Using a similar notation, we can use to represent a segment of consecutive frames taken from the noisy sentence , from frame to . Now, instead of directly comparing the noisy test segment with the clean corpus segment , we compare with through the noisy corpus segments , with variable noise conditions . Assuming an
(10) is obtained by nding a most-probable corpus That is, , and then segment for each xed-length test segment nding the maximum test segment length (i.e., ) that results in the maximum posterior probability.
826
equal prior probability for the different noise conditions, the corresponding posterior probability can be expressed as
robustness can be extended to noise conditions that are not fully represented by any of the given training conditions. The problem of identifying the longest matching corpus segment, (10), based on the multicondition model (13) and the optimal feature subset at each simulated noise condition, can be formulated as
(13)
where, as (4), the likelihood function expressed as
can be
(14) and, as (5), we use the expression (15) represent the out-of-corpus segments likelihood , where is the prior probability of the noise condition . As commonly termed in speech and speaker recognition research, (13) corresponds to a multicondition model. It improves upon the clean-condition model (3) by used offering robustness to the variable noise conditions in the training. It may include the clean-condition model by assuming one of the training conditions containing no noise. As the model assumes a global match between the test segment and the corpus segments, its robustness is limited to the given training conditions. In the next step of the algorithm, the aim is to extend the robustness of the multicondition model beyond the training conditions without assuming extra information about the noise. This can be achieved by allowing for local mismatches between the test segment and the corpus segments, and by using optimal feature selection to remove the mismatching local features from the comparison. Assume that each frame can be represented by independent frequency channels (i.e., subbands). Apply this frequency representation to the frames in test segment and rewrite the segment as: , where is the local feature at time and subband , and represents the full timefrequency space of the segment. Assume that at each training noise condition , we can decompose the full test segment into two subsets: and the complement , represents the specic set of local features in where , indexed by timefrequency subset , that are corrupted at noise condition and are thus matched by the simulated training noise; repthat are corrupted resents the rest of the local features in at different noise conditions from and are thus mismatched by the simulated training noise. Improved robustness can be achieved if we use the matched-condition subset in place of the full set to calculate the likelihood (i.e., the missing-feature theory; see, for example, [38]). In this way, to (16) For each test segment of a given length, the expression seeks to nd the most probable matching corpus segment by jointly maximizing the posterior probability over all corpus segments and all possible test feature subsets within each simulated noise conis the posterior probdition. In (16), given test feaability of the noisy corpus segment ture subset , which can be calculated using (13) but with the subset . replacing the full test segment In practice, we can add white noise at variable SNR levels to clean corpus sentences to simulate the noisy test speech. With optimal feature selection, white noise may be suitable for simulating arbitrary test noise by emphasizing certain frequency bands while deemphasizing others [35], [36]. A union-probability based fast algorithm can be used to solve (16) for the optimal feature sets. The union algorithm makes the following assumption about the likelihood function of the optimal feature set:
(17) denotes the number of elements in set . In other where words, it is assumed that the sum over all for a given number of features is dominated by the optimal set of features. This reduces the problem of nding the exact set of optimal features to the problem of nding the number of optimal features, with a much lower computational complexity (for details, see [36], [37] for examples of using the algorithm for speech and face recognition). In reality, features are rarely completely usable or completely unusable, but somewhere in between. The union algorithm allows a softer probabilistic decision than forcing features to either be used or discarded. We have found that it is helpful to impose a constraint on the minimum size of the optimal feature subsets . The constraint reects a balance between retaining features for discrimination and ignoring mismatches for robustness. In our experiments, we forced each optimal set to contain at least half of the features in the full set.
827
V. RECONSTRUCTING TARGET SPEECH A. Forming the Estimate , after nding For noisy speech and at all , we the longest matching segments can use to form an estimate of the underlying target speech. Let represent the target speech frame at time , , and be its magnitude spectrum. We can obtain an estimate for by taking all the matching segments that contain and averaging over the corresponding corpus frames. In the average, we use the posterior probability, obtained in (16), as a condence score. We use the expression
represents the lter function at time , and is an where estimate of the noise power spectral density, which can be obtained by using the noisy periodogram and the speech power spectral density estimate in a smoothed recursion: (21) in our experiments) where is a smoothing constant ( represents the noisy periodogram at time . For conveand nience, we call the estimate based on (20) a lter-based estimate. In our experiments, both estimates based on the codewords and on the lter produced similar enhancement quality, while the lter-based estimate may be better in terms of keeping the original speech characteristics. Therefore, the lter-based estimates are used in the evaluation. B. Iterative Estimation As described above, given a noisy sentence , the underlying target speech can be reconstructed using the magnitude spectrum estimates . If we assume that adding the reconstructed speech back to the original noisy sentence will result in a sentence with a reduced noise level, then passing the new sentence into the above system to repeat the search for the longest matching segments and reconstruction could lead to an improved estimate for the target speech. This iteration is implemented in our system, as a complement to the above single-pass segmentation/reconstruction algorithm. We use the following expression to generate the new test sentence based on a previous test sentence and estimate: (22) where represents the magnitude spectrum of test frame , is the iteration index, and is a weighting constant. In the combined sentence, the true speech and the estimate reinforce each other. Therefore using the combined sentence as a new input for the LMS system could result in a better estimate than the previous estimate, if not the same. The estimate converges when the and become effectively identwo combined sentences tical. We have tested this iterative algorithm for variable values of between 0.5 and 0.9 and found it converges with improved enhancement quality. VI. EXPERIMENTAL STUDIES The TIMIT database, containing speech data sampled at 16 kHz, was used to evaluate the new LMS approach. In our experiments, the training set of the database, consisting of 3696 sentences from 462 speakers (326 male, 136 female), served as a corpus of free-text, free-speaker speech and was used to build the LMS models; the core test set of the database, consisting of 192 sentences from 24 speakers (16 male, 8 female), was used for testing. There is no overlap between the training speakers and test speakers, and there are no common sentence texts between the training sentences and test sentences. The training corpus was further divided into two gender-dependent sets: a male set containing 2608 sentences from 326 speakers, and a female set containing 1088 sentences from 136 speakers. Each set was modeled, separately, with a corpus GMM [i.e., (1)],
(18) where the sum is over all test segments that contain target frame , is a corresponding corpus frame from , represents a the matching corpus segment prototype magnitude spectrum associated with corpus frame , and is the most-likely time path between and as dened earlier. As shown in (18), each frame is estimated through identication of a longest matching segment, and each estimate is smoothed over successive longest matching segments. This improves robustness both to noise and to imperfect segment match. Frames within the same segment share a common condence score which is the posterior probability of the segment. In (18), is a normalization term. In our experiments, the following expression is found to be suitable: if if (19) The last condition prevents small posterior probabilities being scaled up to give a false emphasis. In our system, we use subband features, which are suitable for timefrequency feature selection, to identify the matching segments, and use the DFT magnitudes of the corpus frames to form the estias the prototype magnitude spectra mate. Given the index of a corpus frame , we can have two . First, can different approaches to calculate be calculated directly on the specic corpus speech frame lo. Second, can be calculated as an average cated by DFT magnitude over all the corpus speech frames used to form . In the latter case, the corpus Gaussian component corresponds to the mean vector of in the DFT magnitude format. For convenience, we call (18) based entirely on the corpus data a codeword-based estimate, by viewing each corpus frame as a codeword and the corresponding corpus or GMM as a codebook. Alternatively, we can form an estimate for the target speech by directly suppressing the noise in the noisy sentence. In this approach, we use the codeword-based estimate (18) to form an optimal lter. In our system, we use a Wiener lter of the form (20)
828
with 4096 Gaussian components with diagonal covariance matrices. Within each set, each corpus sentence was represented [i.e., (2)], on the corresponding using a sentence model corpus GMM. The male and female corpus speech were modeled separately to increase the models resolution on speaker characteristics. For noise compensation, we corrupted each clean corpus sentence at 25 different noise conditions derived from white noise. These include ve different types of noise: white noise without ltering, and white noise with low-pass ltering with a bandwidth of 1, 2, 3, and 4 kHz, respectively, and ve different SNR levels for each noise type: 3, 6, 9, 12, and 15 dB. For each of the 25 simulated noise conditions, we formed a corre[i.e., (11)] for each corpus set, and a corsponding GMM [i.e., (12)] for each noisy responding sentence model corpus sentence, by cloning the frame to Gaussian component alignments found in the clean-condition corpus model. These 25 noise-compensated corpus models, along with the clean-condition model, were used in the multicondition system (16), with a different conditions, to simulate the noisy test total of sentences. The speech was divided into frames of 20 ms with a frame period of 10 ms. To facilitate the selection of optimal timefrequency features [i.e., (16)], we further divided the full frequency band of each frame into subbands and calculated the features for each component subband in each frame. Subband-based frame features can be calculated by extending the fullband MFCCs (Mel-scale cepstral coefcients) computation into subbands [39]. Specically, we calculated a 10-subband, 20-component as follows. First, we used feature vector for each frame a 30-channel Mel-scale lterbank to obtain 30 log lterbank amplitudes. Then, we grouped the 30 log lterbank amplitudes uniformly into ten subbands, and performed a DCT within each subband. For each of the subbands containing three log lterbank amplitudes, we used two DCT coefcients: , , where is the subband index. This gives a vector of ten subbands, i.e., , contains two cepstral coefcients. where each subband These ten subbands, with the addition of their corresponding rst-order delta components, form a 20-component vector, of an overall size of 40 coefcients, for frame , i.e., . Note that we independently modeled the static components and delta components. This allows the model [i.e., (16)] to select only the dynamic components for scoring. In our previous studies, we have found that this is useful for reducing the channel effect, which usually affects static features more adversely than dynamic features. Three types of noise and one type of crosstalk speech were used, respectively, as the test noise. The three noises were: 1) a babble noise (taken from NoiseX92), 2) a polyphonic musical ring, and 3) a pop song with mixed music and voice of a female singer. The crosstalk noise was a sentence from a speaker of a different gender. To simulate the crosstalk noise from male speakers, we concatenated all the male sentences in the core test set, with silences removed, into a long le of noise containing variable male speakers. Likewise, we concatenated all the female sentences in the core test set, with silences removed, into
a long le of noise to simulate the crosstalk noise from variable female speakers. Examples of the spectra of these four type of noise are shown in Fig. 1. As can be seen, while the babble noise exhibited some characteristics of stationarity, the other three types of noise were highly nonstationary. The durations of these noise les range from about 1 min to about 6.5 min. For each noise type, we generated the noisy test sentences by taking the clean test sentences and adding consecutive portions of noise through the noise le in a continuous loop. Since the noise les were much longer than the individual speech sentences (of an average duration of about 3 s), each noisy test sentence effectively contained a different portion of the noise le. For each noise type, noisy test sentences were generated at three different SNR levels: 0, 5, and 10 dB, measured at the individual sentence level. We compared the new LMS algorithm with four other algorithms, chosen from the four classes of conventional enhancement algorithms summarized in [40] implemented in MATLAB. For each class, we chose a representative algorithm which produced the best overall objective evaluation results on our experimental data. These four algorithms were: 1) KLTa subspace algorithm [41], 2) Log MMSE with signal presence uncertaintya model-based optimal estimation algorithm [42], 3) MBanda multiband spectral subtraction algorithm [43], and 4) Wiener ltering with a priori SNR estimationan optimal ltering algorithm [44]. While the new LMS algorithm assumed no information about the noise, the four conventional enhancement algorithms each used a classical noise estimation algorithm to track the underlying noise statistics (for details, see [40]). Although the LMS algorithm is capable of detecting matching segments between the test sentence and the corpus sentences with arbitrary segment lengths up to complete sentences, in the experiments we restricted the search for the matching segments to a maximum length of 30 frames, or 310 ms, to reduce the amount of computation and output delay. The search for the matching segments was accelerated by pruning those corpus segments which produce small posterior probabilities as the segment length increases (more details are given later). A. Results of Speech Enhancement From Noise This section compares the new LMS algorithm with the conventional KLT, Log MMSE, MBand, and Wiener ltering algorithms for enhancing the TIMIT sentences from the three types of noise: babble, musical ring, and pop song. The comparisons include three objective quality measuressegmental SNR, logspectral distance, and phone identication accuracyand informal subjective listening tests. Fig. 2 shows the segmental SNR measures and log-spectral distance measures produced by the ve algorithms at each noise type/SNR condition, averaged over the 192 core test sentences. In our experiments, the babble noise served as an example of slow-varying noise, for which the conventional KLT and Log MMSE algorithms were among the best in terms of improving the segmental SNR, and the conventional MBand and Wiener ltering algorithms were among the best in terms of reducing the log spectral distance. For this type of slow-varying noise,
829
Fig. 1. Examples of the noise types used in the experiments, showing the noise spectra over a period of about three seconds. (a) Babble. (b) Polyphonic musical ring. (c) Pop song. (d) Crosstalk speech.
Fig. 2. Segmental SNR measures (left) and log spectral distance measures (right) for noisy speech and enhanced speech, by the new LMS algorithm compared to the KLT, Log MMSE, MBand and Wiener ltering (WF) algorithms, for three types of noise as a function of the input noisy sentence SNR. (a) Babble noise. (b) Musical ring noise. (c) Pop song noise.
830
TABLE I PHONE IDENTIFICATION ACCURACY (%) ON THE TIMIT CORE TEST SET FOR NOISY SPEECH AND ENHANCED SPEECH, BY THE NEW LMS ALGORITHM COMPARED TO THE KLT AND LOG MMSE ALGORITHMS, FOR THREE TYPES OF NOISE
the new LMS algorithm, which has no noise estimation, was competitive with all the conventional algorithms in both measures. For the other two types of faster-varying noise, musical ring, and pop song, the new LMS algorithm performed consistently better than the conventional algorithms in all the SNR conditions, in both segmental SNR and log spectral distance measures. The conventional noise estimation algorithms may have had difculties in predicting the statistics of these fastvarying noises. A further objective measurephone identication accuracywas used to compare the LMS algorithm with the KLT and Log MMSE algorithms; the latter were the better performers among the four conventional algorithms. By convention, we identied 39 phones. We modeled each context-independent phone by using a three-state HMM with 24 diagonal Gaussian mixtures per state, with each frame represented by a 39-dimensional feature vector including the static, delta, and delta-delta MFCCs. Table I shows the identication accuracy rates over the 192 core test sentences for all the noise type/SNR conditions. The new LMS algorithm achieved signicantly improved recognition accuracy for all the noise conditions, including the relatively slow-varying babble noise. Two types of subjective listening tests were conducted to evaluate the new LMS algorithm. First, we evaluated the algorithm by adhering as much as possible to the Standard ITU-T P.835 [45]. Given a sentence, the standard rates its quality in three separate aspects: 1) the quality of the speech in the sentence, 2) the intrusion of the noise in the sentence, and 3) the overall quality of the sentence [which is similar to the conventional mean opinion score (MOS)]. A ve-category rating scale is used for each aspect of the evaluation. For speech/noise/overall quality, the corresponding scales are: 1)very distorted/very intrusive/bad, 2)fairly distorted/somewhat intrusive/poor, 3)somewhat distorted/noticeable but not intrusive/fair, 4)slightly distorted/slightly noticeable/good, and 5)not distorted/not noticeable/excellent. From the core test set with 192 sentences from 24 speakers, we selected nine sentences from nine speakers (six male, three female), with one sentence for each noise type/SNR, for the
evaluation. Thus, there were a total of 18 test sentences (including nine noisy sentences and nine enhanced sentences). A group of 33 subjects (23 male, 10 female) participated in the test. As suggested in the standard, each subject took two test sessions separated by a short rest period. In each session, each of the 18 sentences was played three times for evaluating the three different aspects. In the rst session, with three repetitions, each test sentence was evaluated rst for the speech, then for the noise, then for the overall quality. In the second session, the order of evaluation for each sentence is changed to rst for the noise, then for the speech, then for the overall quality. The test sentences were presented in a random order in each test session. The scores are averaged across the two sessions. The results are presented in the left column of Fig. 3. As can be seen, compared to the noisy sentences, the LMS algorithm reduced the noise intrusion and improved the overall quality for all the noise conditions in terms of higher ratings for both aspects. The LMS algorithm also improved the quality of speech for the musical ring noise at the 0 and 5 dB SNR conditions. A further informal subjective evaluation, in the form of a preference test, was conducted to compare the LMS algorithm with the four conventional algorithms. For each noise type/SNR condition, six test sentence pairs were generated. In each pair, one sentence was processed by the new LMS algorithm and the other sentence was processed by a conventional algorithm (i.e., KLT, Log MMSE, MBand, or Wiener ltering), or was left without processing. The two sentences in a pair were presented in random order. The same group of 33 subjects participated in the test. Each subject was presented with a total of 54 test noise types SNR levels pairs . The sentence pairs results are presented in the right column of Fig. 3, which shows the percentage of subjects preferring the sentences/algorithms (including the original noisy sentences without processing) in all the sentence pairs. In this preference test, the new LMS algorithm outperformed the other conventional algorithms, including the noisy sentences without processing, for all the noise type/SNR conditions. B. Results of Speech Enhancement From Crosstalk Speech With a little extension to the algorithm, the above LMS approach can be further used to retrieve a target speech sentence from crosstalk speech sentences spoken by different speakers. In this framework, we group the corpus sentences into different speaker classes, each class consisting of speakers with similar acoustic characteristics. The target sentence will be modeled by the corpus sentences within the target speaker class. Given a speech signal with mixed speech sentences, we identify the longest matching segments, used to reconstruct the target sentence, within the target speaker class by treating the interfering sentences from other speaker classes as noise. Thus, the expression (16), for identifying the longest matching corpus segment at time , is modied as follows:
(23)
831
Fig. 3. Subjective ratings for quality of speech (S), lesser intrusion of noise (N) and overall quality of sentence (O) for each SNR level (dB) for noisy speech and enhanced speech by the new LMS algorithm (left), and preference percentage comparing the LMS algorithm with other conventional algorithms (right), for three types of noise.
where the constraint, that the matching corpus segment be chosen from the corpus sentences of the target speaker class , is imposed. We performed a simple experiment, considering only two gender-based speaker classes: male and female, to demonstrate the feasibility of the above algorithm. As described earlier, we created male and female test sentences each corrupted by continual speech from a different gender (i.e., the last type of noise in Fig. 1). The task is to retrieve the proper sentences from the corrupted sentences. The task is difcult not only because the noise is nonstationary, but also because the noise is a form of speech. For this type of speech noise, the conventional enhancement algorithms, relying on noise estimation based on voice activity detection, minimum statistics or time-recursive averaging, do not appear to be applicable. Fig. 4 shows the performance of the LMS algorithm, comparing the objective measures and subjective measures between the noisy speech and the enhanced speech. Three sentences from three speakers (two male, one female) were chosen for the subjective evaluation by the same
group of 33 subjects. The objective scores were averaged over the entire 192 core test sentences. As indicted in Fig. 4, the LMS algorithm improved both the segmental SNR and log spectral distance measures for all the SNR conditions. For the subjective quality ratings, the algorithm received higher score for reducing the noise intrusion at all the SNR levels, and considerable appreciation for improving the overall quality, particularly for the 0- and 5-dB SNR conditions. Though there was a slightly lower overall quality rating (less than 0.06) for the LMS algorithm at the 10-dB SNR condition, the algorithm was preferred over the unprocessed noisy sentence for all the SNR conditions. This may be attributed to the fact that the algorithm reduced the noise intrusion by a considerable amount. Table II shows the objective phone identication results for the 192 test sentences with the crosstalk speech noise. The above algorithm (23) may be further extended to separate mixed speech sentences spoken by different speakers without assuming identity of the individual speaker classes. Given a speech signal with mixed speech sentences, we identify the
832
Fig. 4. Objective evaluation and subjective evaluation for noisy speech and enhanced speech by the new LMS algorithm, for crosstalk speech noise, as a function of the input noisy sentence SNR (dB). TABLE II PHONE IDENTIFICATION ACCURACY (%) ON THE TIMIT CORE TEST SET FOR CROSSTALK SPEECH NOISE
matching corpus segments and reconstruct a corresponding sentence within each potential speaker class. Then, we choose the reconstruction with the best score across the classes as an output sentence (this is, effectively, a joint estimation of the corpus segments and speaker class for the sentence with the highest SNR, by treating the other sentences as noise). Then, we remove the reconstruction from the original signal, and pass the modied signal back to the system to estimate the remaining sentences. This process continues until all sentences are identied. This approach alternates between estimating the individual sentences and thus scales linearly with the number of mixed sentences, while other approaches attempting joint decoding over all combinations scale exponentially (e.g., [46], [47]). This approach obtained good results in the Pascal Speech Separation Challenge for separating two mixed sentences [34]. C. LMS Algorithm Analysis As discussed in Section V-B, given a noisy sentence, an iteration of the LMS estimation, using (22) to form the new input, may result in an improved estimate. Fig. 5 shows the effect of the iterative estimation, in terms of the segmental SNR and log spectral distance changes with the iteration, for all the four types of noise averaged over the 192 test sentences. As can be seen, both measures improved with each iteration, and the improvement became smaller as the number of iterations increased. In our experiments, a maximum of eight iterations were
Fig. 5. Effect of iterative LMS estimation, showing improvement in (a) segmental SNR and (b) log spectral distance with the iteration, for babble, musical ring, pop song, and crosstalk speech noises at different SNR conditions.
833
Fig. 6. Comparing the LMS algorithm with three reduced algorithms: one without simulated noise compensation (-NC), one without optimal feature selection (-FS), and one based on single-frame segments rather than longest matching segments (1-frame seg), for musical ring noise at different SNR conditions.
performed for each of the noise conditions. The results were used for evaluation. As described earlier, the new LMS algorithm consists of three core components: multicondition noise compensation with simulated noise, optimal feature selection to reduce the compensation mismatch, and estimation based on the longest matching segments to increase noise immunity. These three components combined offer robustness to nonstationary noise without assuming noise information. Using one test noise, musical ring, as an example, we studied the impact of each of the three components in terms of improving the objective measures. Fig. 6 shows a comparison of the LMS algorithm with three reduced versions, each having one of the components removed from the algorithm: noise compensation, feature selection, and estimation based on the longest matching segments. The rst reduced algorithm has no noise compensation (noted as -NC), which therefore compares the noisy test sentence directly to clean corpus sentences to identify their longest matching segments, based on optimal timefrequency features that maximize the posterior probabilities. The second reduced algorithm has no feature selection (noted as -FS), which therefore compares the full set of timefrequency features between the noisy test sentence and the corpus sentences with simulated noise, to identify the longest matching segments. The last reduced algorithm (noted as 1-frame seg) has both noise compensation and feature selection as in the LMS algorithm, but does not search the longest matching segments for the estimation. Instead, for each noisy frame, it nds a single-frame corpus segment as an estimate for for all . As shown the frame through (16), by forcing in Fig. 6, all three components are useful for providing positive and independent contributions to the improved objective measures of the combined LMS algorithm, but estimation based on the longest matching segments appears to be the most important in terms of the contribution. An informal listening test has also conrmed that using the longest matching segments is the most important of the three options in improving the perpetual quality. This example further conrms our intuition that maximizing the length of the speech segments to be identied may most effectively increase the noise immunity without assuming noise information. Fig. 7 shows the histogram of the length of the longest matching segments found by the LMS algorithm for the test sentences corrupted by the musical ring noise. Two observations may be drawn from Fig. 7. First, the histograms for different SNR conditions show a similar pattern. Second, over 98% of the longest matching segments are three or more frames
Fig. 7. Histogram of the length of the longest matching segments found by the LMS algorithm for test sentences corrupted by musical ring noise at different SNR conditions.
Fig. 8. Decrease in segmental SNR, increase in log spectral distance, and decrease in absolute phone identication accuracy of a reduced system based on half-sized corpus, compared to the original system based on the full-size corpus, averaged over three types of noise (babble, musical ring, pop song), as a function of the input noisy sentence SNR.
long, with an overall average segment length of about eleven frames, for all the SNR conditions. A further experiment was conducted to evaluate the algorithm with the use of a smaller corpus to model the target speech. Specically, we used a half-sized corpus to rebuild the LMS system. The reduced corpus was obtained from the original corpus by discarding 50% of the training sentences from each training speaker. Thus, with the reduced corpus, we had 1304 (instead of 2608) training sentences for male speakers, and 544 (instead of 1088) training sentences for female speakers. Accordingly, in the reduced LMS system we used 2048 (instead of original 4096) Gaussian components to model each gender
834
Fig. 9. Outline of the proposed LMS algorithm.
based on the reduced data. Fig. 8 presents the results, showing the loss of performance (decrease in segmental SNR, increase in log spectral distance, and decrease in absolute phone identication accuracy) of the reduced system compared to the original system based on the full-size corpus. The measures are averaged over the three types of noise (babble, musical ring, pop song) and the 192 core test sentences. Fig. 8 shows that reducing the corpus size affected all the three objective measures adversely, but not signicantly. As the input SNR increases, the difference between the two systems becomes more apparent, especially in the segmental SNR and log spectral distance measures. This may be due to the fact that, when noise becomes less signicant, the quality of the reconstructed speech will be more dependent on the quality of the corpus data used for reconstruction. Finally, Fig. 9 outlines the computational procedure for the algorithm. Dealing with large corpora is made feasible in several ways. First, by mapping all the corpus sentences to GMMs, the complexity of comparing a test frame against all the corpus frames is scaled down to the calculation of the corpus Gaussians of the frame (Part A in Fig. 9). This mapping also reduced the memory usage (for example, we used less than 300 MB in our enhancement experiments with the TIMIT database as the corpus). Second, while the DTW algorithm can be used to match two segments (Part B), this match can be accelerated by using a linear mapping algorithm assuming the existence of temporally identical speech segments in large speech corpora (we have tested the algorithm with the TIMIT database and obtained results almost as good as those reported in the paper). Finally, pruning was used to remove those unlikely corpus segments after comparing their rst few frames with the test segment (Part C). This has further signicantly increased the execution speed of the algorithm without noticeable loss of performance. Combining these steps, the complexity of the algorithm scales linearly or less with the size of the corpus. VII. CONCLUDING REMARKS This paper has presented a new approach to speech enhancement assuming a lack of prior information about the noise. This assumption applies to heavily nonstationary noise that
can be difcult to predict with conventional noise estimation algorithms. The new approach, called LMS, aims to maximally extract two important features of speechtemporal dynamics and speaker characteristicsfor retrieving speech from noise. It achieves this through recognition of long segments of the target speech as whole units. In the recognition, clean speech sentences taken from a corpus are used to provide examples, of both temporal dynamics and speaker-class characteristics, for the underlying target speech. Matching segments are identied between the noisy sentence and the corpus sentences. The estimate is formed by combining the longest matching segments found. The algorithm is made more robust to noise uncertainty by combining multicondition model training and optimal feature selection into the modeling and identication of the matching segments, based on the missing-feature theory. Examples of nonstationary noise used in the paper for experiments include song, music, and crosstalk speech. The new LMS algorithm has shown improved performance over conventional algorithms in both objective and subjective measures. For a relatively slow-varying babble noise, the new algorithm (without noise estimation) performed similarly to the conventional algorithms (with noise estimation) in SNR and spectral distortion measures, and outperformed the conventional algorithms in phone identication and subjective measures. We are currently studying the possibility of combining the conventional noise estimation algorithms, which are effective in tracking slow-varying noise, into the LMS algorithm, to improve the algorithm for predictable noise while retaining robustness to unpredictable noise. The combination may be conducted in a model adaptation fashion: the noise statistics predicted by a conventional algorithm are used to update the corpus GMM so that the corpus sentence models, built on the GMM, could become more tuned to the specic type of noise. This should help increase the accuracy of identifying the longest matching segments and thereby improve the reconstruction quality. Additionally, we are extending the current LMS algorithm for speech enhancement to give a speech separation algorithm, suitable for separating free-text mixed sentences spoken by different speakers. This may be achieved by introducing speaker classes into the corpus models, as discussed earlier.
835
ACKNOWLEDGMENT The authors would like to thank the four anonymous reviewers for their helpful and constructive comments. REFERENCES
[1] S. F. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE. Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 2, pp. 113120, Apr. 1979. [2] J. S. Lim and A. V. Oppenheim, Enhancement and bandwidth compression of noisy speech, Proc. IEEE, vol. 67, no. 12, pp. 15861604, Dec. 1979. [3] R. J. McAulay and K. L. Malpass, Speech enhancement using a softdecision noise suppression lter, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28, no. 2, pp. 137145, Apr. 1980. [4] Y. Ephraim and D. Malah, Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-32, no. 6, pp. 11091121, Dec. 1984. [5] P. J. Wolfe and S. J. Godsill, Efcient alternatives to the ephraim and malah suppression rule for audio signal enhancement, EURASIP J. Appl. Signal Process., vol. 10, pp. 10431051, 2003. [6] T. Lotter and P. Vary, Speech enhancement by MAP spectral amplitude estimation using a super-Gaussian speech model, EURASIP J. Appl. Signal Process., vol. 2005, pp. 11101126, 2005. [7] P. J. Wolfe and S. J. Godsill, Towards a perceptually optimal spectral amplitude estimator for audio signal enhancement, in Proc. ICASSP00, Istanbul, Turkey, 2000, pp. 821824. [8] R. Martin, Speech enhancement using MMSE short time spectral estimation with gamma distributed speech priors, in Proc. ICASSP02, Orlando, FL, 2002, pp. 253256. [9] R. Martin and C. Breithaupt, Speech enhancement in the DFT domain using Laplacian speech priors, in Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC2003), Kyoto, Japan, 2003, pp. 8790. [10] I. Cohen, Speech enhancement using super-Gaussian speech models and noncausal a priori SNR estimation, Speech Commun., vol. 47, pp. 336350, 2005. [11] S. Srinivasan, J. Samuelsson, and W. B. Kleijn, Codebook driven short-term predictor parameter estimation for speech enhancement, IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 1, pp. 163176, Jan. 2006. [12] A. Kundu, S. Chatterjee, A. S. Murthy, and T. V. Sreenivas, GMM based Bayesian approach to speech enhancement in signal/transform domain, in Proc. ICASSP08, Las Vegas, NV, 2008, pp. 48934896. [13] Y. Ephraim and H. L. V. Trees, A signal subspace approach for speech enhancement, IEEE Trans. Speech Audio Process., vol. 3, no. 4, pp. 251266, Jul. 1995. [14] P. Hansen, P. Hansen, S. Hansen, and J. Sorensen, Experimental comparison of signal subsapce based noise reduction methods, in Proc. ICASSP99, Phoenix, AZ, 1999, pp. 101104. [15] J. Jensen and R. Heusdens, Improved subspace-based single-channel speech enhancement using generalized super-Gaussian priors, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 862872, Mar. 2007. [16] J. Sohn and N. Kim, Statistical model-based voice activity detection, IEEE Signal Process. Lett., vol. 6, no. 1, pp. 13, Jan. 1999. [17] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Trans. Speech Audio Process., vol. 9, no. 5, pp. 504512, Jul. 2001. [18] I. Cohen, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466475, Sep. 2003. [19] L. Lin, W. Holmes, and E. Ambikairajah, Subband noise estimation for speech enhancement using a perceptual Wiener lter, in Proc. ICASSP03, Hong Kong, China, 2003, pp. 8083. [20] S. Rangachari and P. Loizou, A noise estimation algorithm for highly nonstationary environments, Speech Commun., vol. 28, pp. 220231, 2006. [21] R. C. Hendriks, R. Heusdens, and J. Jensen, MMSE based noise PSD tracking with low complexity, in Proc. ICASSP10, Dallas, TX, 2010, pp. 42664269. [22] J. Hansen and M. Clements, Constrained iterative speech enhancement with applications to speech recognition, IEEE Trans. Signal Process., vol. 39, no. 4, pp. 795805, Apr. 1991.
[23] Y. Ephraim, D. Malah, and B. H. Juang, On the application of hidden Markov models for enhancing noisy speech, IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 12, pp. 18461856, Dec. 1989. [24] M. E. Deisher and A. S. Spanias, HMM-based speech enhancement using harmonic modeling, in Proc. ICASSP97, Munich, Germany, 1997, pp. 11751178. [25] H. Sameti and L. Deng, Nonstationary-state hidden Markov model representation of speech signals for speech enhancement, Signal Process., vol. 82, pp. 205227, 2002. [26] D. Y. Zhao and W. B. Kleijn, HMM-based gain modeling for enhancement of speech in noise, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp. 882892, Mar. 2007. [27] X. Xiao, P. Lee, and R. M. Nickel, Inventory based speech enhancement for speaker dedicated speech communication systems, in Proc. ICASSP09, Taipei, Taiwan, 2009, pp. 38773880. [28] S. Gannot, Iterative and sequential Kalman lter-based speech enhancement algorithms, IEEE Trans. Speech Audio Process., vol. 6, no. 4, pp. 373385, Jul. 1998. [29] S. Windmann and R. Haeb-Umbach, Iterative speech enhancement using a non-linear dynamic state model of speech and its parameters, in Proc. ICASSP06, Toulouse, France, 2006, pp. 465468. [30] M. De Wachter, K. Demuynck, D. Van Compernolle, and P. Wambacq, Data driven example based continuous speech recognition, in Proc. Eurospeech03, Geneva, Switzerland, 2003, pp. 11331136. [31] S. Axelrod and B. Maison, Combination of hidden Markov models with dynamic time warping for speech recognition, in Proc. ICASSP04, Montreal, QC, Canada, 2004, pp. 173176. [32] G. Aradilla, J. Vepa, and H. Bourlard, Improving speech recognition using a data-driven approach, in Proc. Interspeech05, Lisbon, Portugal, 2005, pp. 33333336. [33] L. R. Rabiner, A. E. Rosenberg, and S. E. Levinson, Considerations in dynamic time warping algorithms for discrete word recognition, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-26, no. 6, pp. 575582, Dec. 1978. [34] J. Ming, T. J. Hazen, and J. R. Glass, Combining missing-feature theory, speech enhancement, and speaker-dependent/-independent modeling for speech separation, Comput. Speech Lang. Special Iss. Speech Separation and Recognition, , vol. 24, pp. 6776, 2010. [35] J. Ming, Noise compensation for speech recognition with arbitrary additive noise, IEEE Trans. Speech Audio Process., vol. 14, no. 3, pp. 833844, May 2006. [36] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, Robust speaker recognition in noisy conditions, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 5, pp. 17111723, Jul. 2007. [37] J. Lin, J. Ming, and D. Crookes, Robust face recognition using posterior union model based neural networks, IET Comput. Vis., vol. 3, pp. 130142, 2009. [38] R. P. Lippmann and B. A. Carlson, Using missing feature theory to actively select features for robust speech recognition with interruptions, ltering and noise, in Proc. Eurospeech97, Rhodes, Greece, 1997, pp. 3740. [39] J. Ming, P. Jancovic, and F. J. Smith, Robust speech recognition using probabilistic union models, IEEE Trans. Speech Audio Process., vol. 10, no. 6, pp. 403414, Sep. 2002. [40] P. Loizou, Speech Enhancement: Theory and Practice. New York: CRC Taylor & Francis, 2007. [41] Y. Hu and P. Loizou, A generalized subspace approach for enhancing speech corrupted by colored noise, IEEE Trans. Speech Audio Process., vol. 11, no. 4, pp. 334341, Jul. 2003. [42] I. Cohen, Optimal speech enhancement under signal presence uncertainty using log-spectra amplitude estimator, IEEE Signal Process. Lett., vol. 9, no. 4, pp. 113116, Apr. 2002. [43] S. Kamath and P. Loizou, A multi-band spectral subtraction method for enhancing speech corrupted by colored noise, in Proc. ICASSP02, Orlando, FL, 2002, pp. 41644167. [44] P. Scalart and J. Filho, Speech enhancement based on a priori signal to noise estimation, in Proc. ICASSP96, Atlanta, GA, 1996, pp. 629632. [45] Subjective Test Methodology for Evaluating Speech Communication Systems That Include Noise Suppression Algorithm ITU-T Recommendation, 2003, pp. 835835. [46] T. Virtanen, Speech recognition using factorial hidden Markov models for separation in the feature space, in Proc. Interspeech06, Pittsburgh, PA, 2006, pp. 8992. [47] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, Super-human multi-talker speech recognition: A graphical modeling approach, Comput. Speech Lang., vol. 24, pp. 4566, 2010.
836
Ji Ming (M97) received the B.Sc. degree from Sichuan University, Chengdu, China, in 1982, the M.Phil. degree from Changsha Institute of Technology, Changsha, China, in 1985, and the Ph.D. degree from the Beijing Institute of Technology, Beijing, China, in 1988, all in electronic engineering. He was Associate Professor with the Department of Electronic Engineering, Changsha Institute of Technology, from 1990 to 1993. Since 1993, he has been with the Queens University Belfast, Belfast, U.K., where he is currently a Professor in the School of Electronics, Electrical Engineering, and Computer Science. From 2005 to 2006, he was a Visiting Scientist at the Computer Science and Articial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge. His research interests include speech and language processing, image processing, and pattern recognition.
Danny Crookes (M98) was appointed to the Chair of Computer Engineering in 1993 at Queens University Belfast, Belfast, U.K., and was Head of Computer Science from 1993 to 2002. He is currently Director of Research for Speech, Image, and Vision Systems at the Institute of Electronics, Communications, and Information Technology, Queens University Belfast. His current research interests include the use of novel architectures (especially GPUs) for high-performance speech and image processing. He is currently involved in projects in automatic shoeprint recognition, speech separation and enhancement, and processing of 4-D confocal microscopy imagery. He has some 200 scientic papers in journals and international conferences.
Ramji Srinivasan (M08) received the B.E. degree in electrical and electronics engineering from Madurai Kamaraj University, Madurai, India, the M.Tech. degree in process control and instrumentation from Regional Engineering College, Trichy, India, and the Ph.D. degree in electrical engineering from Anna University, Chennai, India, in 2000, 2002, and 2008, respectively. He started his research career with the Fluid Control Research Institute, Palakkad, India, as a Post Graduate Research Fellow in 2002 and later joined the National Institute of Ocean Technology, Chennai, as a Scientist and worked there from 2003 to 2008. Since 2009, he has been a Research Fellow in the Institute of Electronics, Communications, and Information Technology, Queens University Belfast, Belfast, U.K. His research interests include speech processing, underwater acoustic signal measurements and processing, and instrumentation system design and integration.

A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Corpus-Based Approach To Speech Enhancement From Nonstationary Noise

Hochgeladen von

Copyright:

Verfügbare Formate

822

A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise

1558-7916/$26.00 2010 IEEE

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

where, as (4), the likelihood function expressed as

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

Fig. 9. Outline of the proposed LMS algorithm.

MING et al.: CORPUS-BASED APPROACH TO SPEECH ENHANCEMENT FROM NONSTATIONARY NOISE

Das könnte Ihnen auch gefallen