Sie sind auf Seite 1von 34

ROBUST SPEAKER IDENTIFICATION USING

WAVELET TRANSFORM AND GAUSSIAN


MIXTURE MODEL

JINOSH.T.G (96007106038)
HARINATH.S (96007106029)
SUBASH VISAKAN.K (96007106099)

Mr. S.SELVA NIDHYANANTHAN, AP/ECE


Wednesday, December 08, 202 III REVIEW ECE-A 1
1
OBJECTIVE

• Maintaining a database of certain voice tracks and


capturing a vocal track and to recognize the track if it is
already available in the database.

• Achieving high efficiency speaker recognition using


Wavelet Transform and Gaussian Mixture Model.

Wednesday, December 08, 202 III REVIEW ECE-A 2


1
SPEAKER RECOGNITION
• Speaker recognition is concerned with extracting the
identity of the person speaking.
• It can be divided in to two parts
a) Speaker verification
b) Speaker identification

• Speaker verification refers to whether the speech sample


belongs to some specific speaker or not.
• In speaker identification, the goal is to determine, which
one of a group of known voices best matches the input
voice samples.

Wednesday, December 08, 202 III REVIEW ECE-A 3


1
SPEAKER RECOGNITION continued…
• The speaker recognition can be either Text dependent
(TD) or Text independent(TI) form.

• Unlike the text dependent system, in text-independent


system, there is no limitation in the keyword recorded
first in the system.

• The speech extraction and the speaker dependent


characteristics of the speech signal can effectively
distinguish one speaker from another is the key point
seriously impacting the performance of the system.

Wednesday, December 08, 202 III REVIEW ECE-A 4


1
LITERATURE REVIEW
1. Donglai Zhu;   Bin Ma;   Haizhou Li, “Speaker Verification With Feature-
Space MAPLR Parameters”IEEE Transactions on Audio,speech and
language processing, Vol: 19 May 2010,pp 505 - 515

•Donglai Zhu characterizes a speaker by the difference between


the speaker and a cohort of background speakers in the form of
feature-space maximum aposteriori linear regression
(fMAPLR).

Limitation:
•Their experiments show that the performance is plagued by
noise because of the simplified extraction method.

Wednesday, December 08, 202 III REVIEW ECE-A 5


1
LITERATURE REVIEW continued…
2. G.M. White and R.B. Neely, “Speech recognition experiments with linear
prediction, bandpass filtering and dynamic programming,” IEEE
Transactions on Acoustics , speech, Signal processing,
Vol.24,1976,pp.183-188

•Earlier, Linear predictive coding(LPC) was used because


of its simplicity and effectiveness in speaker recognition.

Limitation:
•Their experiments show that the performance is plagued
by noise because of the simplified extraction method.

Wednesday, December 08, 202 III REVIEW ECE-A 6


1
LITERATURE REVIEW continued…
3. R.Vergin, D. O’Shughnessy, and A. Farhat, “Generalized mel frequency
cepstral coefficients for large-vocabulary speaker-independent continuous-
speech recognition,” IEEE Transactions on Speech and Audio Processing, Vol.
7, 1999, pp. 525-532
• Another widely used parameter: Mel Frequency Cepstral
Coefficients(MFCC) because they are calculated by using a
filter-bank approach in which the set of filters has equal
bandwidth with respect to Mel scale frequencies. This is
based on the fact that human perception of frequency
content of sounds does not follow a linear scale.
Limitation:
• Continuous speech required increased computation time
and Mel filter bank design is also complex.
Wednesday, December 08, 202 III REVIEW ECE-A 7
1
LITERATURE REVIEW continued…
4. S. Furui, “Cepstral analysis technique for automatic speaker
verification,” IEEE Transactions on Acoustics, Speech and audio
processing, Vol. 24, 1976, pp. 183-188.

• Furui used cepstral analysis and mean normalization


technique to improve the identification performance by
minimizing intersession variability..
Limitation:
• However the average spectra are susceptible to variations
due to manner of speech. Eg:loud or soft and noisy
environments

Wednesday, December 08, 202 III REVIEW ECE-A 8


1
LITERATURE REVIEW continued…
5. K.Gopalan, T.R.Anderson, and E.J.Cupples, “A Comparison of Speaker
identification results using features based on cepstrum and Fourier-Bessel
expansion,” IEEE Transactions on Speech, Audio Processing, Vol. 7, 1999,
pp. 289-294.
6. F.Phan, M.T.Evangelia, and S.Sideman, “Speaker identification using neural
networks and wavelets,” IEEE Engineering in Medicine and Biology
Magazine and Biology Magazine, Vol. 11, 1989, pp. 674-693.
• Gopalan proposed a compact representation for speech using
Bessel functions because of similarity between voiced
speech and bessel functions.
• Phan used wavelet transform to divide speech in to four
octaves by using Quadrature Mirror Filters(QMFs).
Limitation:
• Each utterance is constrained within 0.5sec, and thus the
speech samples are edited to truncate each trailing space in
the utterance.
Wednesday, December 08, 202 III REVIEW ECE-A 9
1
PROPOSED METHOD

• The proposed method of Speaker Identification is based


on deriving LPCC features from approximation
coefficients, entropy features of detailed coefficients of
wavelet transform and modeling them using Gaussian
Mixture Model.

Wednesday, December 08, 202 III REVIEW ECE-A 10


1
STAGES IN THE PROPOSED
ALGORITHM

Wednesday, Dec III REVIEW ECE-A 11


ember 08, 2021
STAGES

• The algorithm involves speech extraction based on time


frequency and multi resolution analysis.
• The input signal is then preprocessed by filter. The
filtered signal is interpolated.
• The interpolated signal is then segmented to various
frames. After segmenting, windowing is done. Then,
wavelet transform is applied to decompose the signal into
lower and higher frequency channels which are
uncorrelated.

Wednesday, December 08, 202 III REVIEW ECE-A 12


1
STAGES continued…
• For capturing the characteristics of individual speaker,
LPCC of the lower frequency channel and the Entropy of
the higher frequency channel are calculated. By using
wavelet decomposition, the related coefficients are
extracted. Using Gaussian Mixture Model, the speaker
recognition system is modeled and unknown speaker is
tested with known speakers in the database
• Finally the speech database is used to evaluate the
proposed extraction algorithm for text independent
speaker identification.

Wednesday, December 08, 202 III REVIEW ECE-A 13


1
DATABASE DEVELOPMENT

• Recording platform : Gold Wave 548


• Sampling frequency : 16KHz
• Recording length : 2minutes
• Recording mode : mono
• Encoding : Pulse Code Modulation
• Microphone used : Condenser type
• Database size : 50 samples
• Speech segments : 6000 segments

Wednesday, December 08, 202 III REVIEW ECE-A 14


1
RECORDING SPEECH SIGNAL FROM DIFFERENT
SPEAKERS

Wednesday, December 08, 202 III REVIEW ECE-A 15


1
ORIGINAL SIGNAL

FILTERED SIGNAL

AMPLITUDE INTERPOLATED SIGNAL

FRAMED SIGNAL

WINDOWED SIGNAL

Wednesday, December 08, 202 III REVIEW ECE-A 16


1 LENGTH OF THE SIGNAL
WAVELET TRANSFORM
• The Wavelet Transform (WT) is localized in both time and
frequency
•Basis functions of the wavelet transform (WT) are small
waves located in different times
•They are obtained using scaling and translation of a scaling
function and wavelet function
•In addition, the WT provides a Multi-Resolution system.
•Multi-Resolution is useful in speaker recognition
application.

Wednesday, December 08, 202 III REVIEW ECE-A 17


1
WAVELET TRANSFORM continued…
• We can construct discrete WT via iterated (octave-band)
filter banks
• The analysis section is illustrated below

Wednesday, December 08, 202 III REVIEW ECE-A 18


1
AFTER APPLYING DISCRETE WAVELET
TRANSFORM(db2) 3 LEVELS

STAGE 1 STAGE 2 STAGE 3


AMPLITUDE

Wednesday, December 08, 202 III REVIEW ECE-A 19


1 COEFFICIENTS
Entropy Calculation

• All the entropy values within detail channels are


calculated to construct a more compact feature vector.
The entropy of wavelet coefficients is calculated by

m 1
Entropy   pi log 2( pi )
i 0

X (i )
where p i  m 1


i0
X

Wednesday, December 08, 202 III REVIEW ECE-A 20


1
LPCC

• The Linear Predictive Cepstral Coefficients(LPCC)


within the approximation channel are calculated for
capturing the characteristics of the vocal track.
• Calculation of Linear Prediction Cepstral
Coefficients(LPCC) involves two steps,
1) Finding the coefficients of Linear Predictive
Coding(LPC).
2) Converting the LPC coefficients in to Cepstral
coefficients.

Wednesday, December 08, 202 III REVIEW ECE-A 21


1
LPCC Continued…

• The Cepstral coefficients can be computed by the


following recursion,

• Where a(m), m=1,2…p are LPC coefficients and p is


the model order.

Wednesday, December 08, 202 III REVIEW ECE-A 22


1
FINAL FEATURES

• The final features are calculated by the formula given


below

Wednesday, December 08, 202 III REVIEW ECE-A 23


1
MODELING

• Gaussian Mixture Models (GMMs) are among the most


statistically mature methods for clustering . A GMM is a
parametric probability density function represented as a
weighted sum of Gaussian component densities.
• They are formed by combining multivariate normal
density components. GMM parameters are estimated
from training data using the iterative Expectation-
Maximization (EM) algorithm or K-Means algorithm

Wednesday, December 08, 202 III REVIEW ECE-A 24


1
MODELING Continued…
• We use K-Means algorithm for pattern recognition in modeling.
It is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem. If the groups of n-
dimensional vectors are given, there is a need to classify the
vectors into ‘K’category.
• The centriod vectors of each group are used to represent the
identity for ‘K’ category. ie. If a new vector is to be classified
among the category, the distance between the new vector with all
the centroids are computed. The centriod corresponding to the
smallest distance is treated as the identified group. The centroids
of all the groups are obtained using K-Means algorithm. The
mixture weights, means and variances are calculated.

Wednesday, December 08, 202 III REVIEW ECE-A 25


1
Centroid vectors formed for each group
using K - Means algorithm.

Wednesday, December 08, 202 III REVIEW ECE-A 26


1
MODELING Continued…

Wednesday, December 08, 202 III REVIEW ECE-A 27


1
SYSTEM EVALUATION
NUMBER OF TRAILS NUMBER OF CORRECT NUMBER OF WRONG
IDENTIFICATION IDENTIFICATION

50 33 17

EFFICIENCY: 66%

Wednesday, December 08, 202 III REVIEW ECE-A 28


1
SOCIAL IMPACT

• An attendance system can be maintained with the


prerecorded database of the students.
• It can be used in Military purposes for identifying the
exact speaker.
• It can also be used in security applications in homes and
banks.

Wednesday, December 08, 202 III REVIEW ECE-A 29


1
CONCLUSION
• It is expected that the proposed method will work
successfully in recognizing the speaker even in low SNR
environments corrupted by gaussian white noise.

Wednesday, December 08, 202 III REVIEW ECE-A 30


1
REFERENCES
• [1] D. A. Reynolds and R.C.Rose published a paper ,“Robust test-independent
speaker identification using Gaussian mixture speaker models.”IEEE Transaction on
Speech Audio Processing ,vol.3,1995,pp 72-83.
• [2] C. M Alamo, F. J. C. Gil, C. T. Munilla, and L. H. Gomez, “Discriminative
training of GMM for speaker identification,’’ in proceedings of IEEE International
Conference of Acoustic Speech Signal Processing, 1996, pp. 89-92.
• [3] C . S. Burrus,R. A. Gopinath,and H. Guo,Introduction to wavelets and wavelet
transforms, Prentice Hall, New Jersey, 1997.
• [4] I. Daubechies, “Orthonormal bases of compactly supported wavelets,”
Communication Pure Applied Mathematics, vol. 41, 1988,pp. 909-996.
• [5] J. R. Deller,J.G. Proakis, and H. L. Hansen, discrete time processing of speech
signals, Macmillan, New York, 1993.
• [6] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE
Transactions on Acoustics Speech, Signal Processing, Vol. ASSP-29, 1981 pp.254-
272

Wednesday, December 08, 202 III REVIEW ECE-A 31


1
[7] H. C. Wang, “MAT-A project to collect mandarin speech data through telephone
networks in Taiwan,” computational linguistics and Chinese language processing,Vol.
2,1997 pp. 73-90.
[8] S. G. Mallat , “A Theory for multi resolution signal decomposition:,the wavelet
representation,”IEEE Transactions on Pattern Analysis Machine intelligence,Vol. 11,
1989, pp. 674-693.
[9] C. Miyajima,Y. Hattori,K. Tokuda, T. Masuko,T. Kobayashi, and T. Kitamura, “Text-
Independent speaker identification using Gaussian mixture models based on multi-space
probability distribution,” IEEE Transactions on information and system, Vol. E84-
B,2001, pp.847-855.
[10] B . L. Pellom and J. H. L. Hansen, “An effective scoring algorithm for Gaussian
mixture model based on speaker identification,” IEEE Signal Processing Letters,Vol.
5,1998,pp. 281-284.
[11]MJ Carey, ES Parris… -“ Robust prosodic features for speaker identification” 1996.
ICSLP 96-2002.
[12] Wark, T., and Sridharan, S., “Adaptive Fusion of Speech and Lip Information for
Robust Speaker Identification”, Digital Signal Processing11 (2001) 169–186.
[13] Gish, H.;   Schmidt, M.;  “Text independent speaker identification.”Signal
processing magazine,IEEE Vol. 11-(18-32).
[14] Daniel J. Mashao and Marshalleno Skosan “Combining classifier decisions for
robust speaker identification” Pattern Recognition-Volume 39, Issue 1 January 2006,
Pages 147-155. 
Wednesday, December 08, 202 III REVIEW ECE-A 32
1
[15] SR Cloude “An entropy based classification scheme for land applications of
polarimetric SAR and Remote Sensing”, IEEE on Geoscience and Remote sensing-
2002, pp.1100 – 1102.
[16] Young han lee and hong kook kim ,“Entropy coding of compressed feature
parameters for distributed speech recognition”IEEE Transactions on Speech
Communication-Volume 52, Issue 5, May 2010, Pages 405-412.
[17] DS Kim, SY Lee , “Auditory processing of speech signals for robust speech
recognition in real-world noisy environments”- IEEE Transactions on Speech and
Audio Processing, 2002,pp. 55 - 69 .
[18] E Wong, “Comparison of linear prediction cepstrum coefficients and mel-
frequency cepstrum coefficients for language identification” IEEE Transactions on
Multimedia, Video and Speech 2002,pp. 95 – 98.
[19] KY Park, “Narrowband to wideband conversion of speech using GMM based
transformation” - Acoustics, Speech, and Signal Processing , 2002,pp 1843 - 1846
vol.3.
[20] IW Selesnick , “The double-density dual-tree DWT”, IEEE Transactions on signal
processing, 2004,pp.1304 – 1314.

Wednesday, December 08, 202 III REVIEW ECE-A 33


1
Thank You!!!

Wednesday, December 08, 202 III REVIEW ECE-A 34


1

Das könnte Ihnen auch gefallen