Robust Speaker Identification Using Wavelet Transform and Gaussian Mixture Model

ROBUST SPEAKER IDENTIFICATION USING
WAVELET TRANSFORM AND GAUSSIAN

MIXTURE MODEL
JINOSH.T.G (96007106038)
HARINATH.S (96007106029)
SUBASH VISAKAN.K (96007106099)
Mr. S.SELVA NIDHYANANTHAN, AP/ECE

Wednesday, December 08, 202 III REVIEW ECE-A 1
1
OBJECTIVE
• Maintaining a database of certain voice tracks and

capturing a vocal track and to recognize the track if it is
already available in the database.
• Achieving high efficiency speaker recognition using

Wavelet Transform and Gaussian Mixture Model.

1
SPEAKER RECOGNITION
• Speaker recognition is concerned with extracting the
identity of the person speaking.
• It can be divided in to two parts
a) Speaker verification
b) Speaker identification
• Speaker verification refers to whether the speech sample

belongs to some specific speaker or not.
• In speaker identification, the goal is to determine, which
one of a group of known voices best matches the input
voice samples.

1
SPEAKER RECOGNITION continued…
• The speaker recognition can be either Text dependent
(TD) or Text independent(TI) form.
• Unlike the text dependent system, in text-independent

system, there is no limitation in the keyword recorded
first in the system.
• The speech extraction and the speaker dependent

characteristics of the speech signal can effectively
distinguish one speaker from another is the key point
seriously impacting the performance of the system.

1
LITERATURE REVIEW
1. Donglai Zhu; Bin Ma; Haizhou Li, “Speaker Verification With Feature-
Space MAPLR Parameters”IEEE Transactions on Audio,speech and
language processing, Vol: 19 May 2010,pp 505 - 515
•Donglai Zhu characterizes a speaker by the difference between

the speaker and a cohort of background speakers in the form of
feature-space maximum aposteriori linear regression
(fMAPLR).
Limitation:
•Their experiments show that the performance is plagued by
noise because of the simplified extraction method.

1
LITERATURE REVIEW continued…
2. G.M. White and R.B. Neely, “Speech recognition experiments with linear
prediction, bandpass filtering and dynamic programming,” IEEE
Transactions on Acoustics , speech, Signal processing,
Vol.24,1976,pp.183-188
•Earlier, Linear predictive coding(LPC) was used because

of its simplicity and effectiveness in speaker recognition.
Limitation:
•Their experiments show that the performance is plagued
by noise because of the simplified extraction method.

1
3. R.Vergin, D. O’Shughnessy, and A. Farhat, “Generalized mel frequency
cepstral coefficients for large-vocabulary speaker-independent continuous-
speech recognition,” IEEE Transactions on Speech and Audio Processing, Vol.
7, 1999, pp. 525-532
• Another widely used parameter: Mel Frequency Cepstral
Coefficients(MFCC) because they are calculated by using a
filter-bank approach in which the set of filters has equal
bandwidth with respect to Mel scale frequencies. This is
based on the fact that human perception of frequency
content of sounds does not follow a linear scale.
Limitation:
• Continuous speech required increased computation time
and Mel filter bank design is also complex.
1
4. S. Furui, “Cepstral analysis technique for automatic speaker
verification,” IEEE Transactions on Acoustics, Speech and audio
processing, Vol. 24, 1976, pp. 183-188.
• Furui used cepstral analysis and mean normalization

technique to improve the identification performance by
minimizing intersession variability..
Limitation:
• However the average spectra are susceptible to variations
due to manner of speech. Eg:loud or soft and noisy
environments

1
5. K.Gopalan, T.R.Anderson, and E.J.Cupples, “A Comparison of Speaker
identification results using features based on cepstrum and Fourier-Bessel
expansion,” IEEE Transactions on Speech, Audio Processing, Vol. 7, 1999,
pp. 289-294.
6. F.Phan, M.T.Evangelia, and S.Sideman, “Speaker identification using neural
networks and wavelets,” IEEE Engineering in Medicine and Biology
Magazine and Biology Magazine, Vol. 11, 1989, pp. 674-693.
• Gopalan proposed a compact representation for speech using
Bessel functions because of similarity between voiced
speech and bessel functions.
• Phan used wavelet transform to divide speech in to four
octaves by using Quadrature Mirror Filters(QMFs).
Limitation:
• Each utterance is constrained within 0.5sec, and thus the
speech samples are edited to truncate each trailing space in
the utterance.
1
PROPOSED METHOD
• The proposed method of Speaker Identification is based

on deriving LPCC features from approximation
coefficients, entropy features of detailed coefficients of
wavelet transform and modeling them using Gaussian
Mixture Model.

1
STAGES IN THE PROPOSED
ALGORITHM
Wednesday, Dec III REVIEW ECE-A 11

ember 08, 2021
STAGES
• The algorithm involves speech extraction based on time

frequency and multi resolution analysis.
• The input signal is then preprocessed by filter. The
filtered signal is interpolated.
• The interpolated signal is then segmented to various
frames. After segmenting, windowing is done. Then,
wavelet transform is applied to decompose the signal into
lower and higher frequency channels which are
uncorrelated.

1
STAGES continued…
• For capturing the characteristics of individual speaker,
LPCC of the lower frequency channel and the Entropy of
the higher frequency channel are calculated. By using
wavelet decomposition, the related coefficients are
extracted. Using Gaussian Mixture Model, the speaker
recognition system is modeled and unknown speaker is
tested with known speakers in the database
• Finally the speech database is used to evaluate the
proposed extraction algorithm for text independent
speaker identification.

1
DATABASE DEVELOPMENT
• Recording platform : Gold Wave 548

• Sampling frequency : 16KHz
• Recording length : 2minutes
• Recording mode : mono
• Encoding : Pulse Code Modulation
• Microphone used : Condenser type
• Database size : 50 samples
• Speech segments : 6000 segments

1
RECORDING SPEECH SIGNAL FROM DIFFERENT
SPEAKERS

1
ORIGINAL SIGNAL
FILTERED SIGNAL
AMPLITUDE INTERPOLATED SIGNAL
FRAMED SIGNAL
WINDOWED SIGNAL

1 LENGTH OF THE SIGNAL
WAVELET TRANSFORM
• The Wavelet Transform (WT) is localized in both time and
frequency
•Basis functions of the wavelet transform (WT) are small
waves located in different times
•They are obtained using scaling and translation of a scaling
function and wavelet function
•In addition, the WT provides a Multi-Resolution system.
•Multi-Resolution is useful in speaker recognition
application.

1
WAVELET TRANSFORM continued…
• We can construct discrete WT via iterated (octave-band)
filter banks
• The analysis section is illustrated below

1
AFTER APPLYING DISCRETE WAVELET
TRANSFORM(db2) 3 LEVELS
STAGE 1 STAGE 2 STAGE 3

AMPLITUDE

1 COEFFICIENTS
Entropy Calculation
• All the entropy values within detail channels are

calculated to construct a more compact feature vector.
The entropy of wavelet coefficients is calculated by
m 1
Entropy   pi log 2( pi )
i 0
X (i )
where p i  m 1

i0
X

1
LPCC
• The Linear Predictive Cepstral Coefficients(LPCC)

within the approximation channel are calculated for
capturing the characteristics of the vocal track.
• Calculation of Linear Prediction Cepstral
Coefficients(LPCC) involves two steps,
1) Finding the coefficients of Linear Predictive
Coding(LPC).
2) Converting the LPC coefficients in to Cepstral
coefficients.

1
LPCC Continued…
• The Cepstral coefficients can be computed by the

following recursion,
• Where a(m), m=1,2…p are LPC coefficients and p is

the model order.

1
FINAL FEATURES
• The final features are calculated by the formula given

below

1
MODELING
• Gaussian Mixture Models (GMMs) are among the most

statistically mature methods for clustering . A GMM is a
parametric probability density function represented as a
weighted sum of Gaussian component densities.
• They are formed by combining multivariate normal
density components. GMM parameters are estimated
from training data using the iterative Expectation-
Maximization (EM) algorithm or K-Means algorithm

1
MODELING Continued…
• We use K-Means algorithm for pattern recognition in modeling.
It is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem. If the groups of n-
dimensional vectors are given, there is a need to classify the
vectors into ‘K’category.
• The centriod vectors of each group are used to represent the
identity for ‘K’ category. ie. If a new vector is to be classified
among the category, the distance between the new vector with all
the centroids are computed. The centriod corresponding to the
smallest distance is treated as the identified group. The centroids
of all the groups are obtained using K-Means algorithm. The
mixture weights, means and variances are calculated.

1
Centroid vectors formed for each group
using K - Means algorithm.

1
MODELING Continued…

1
SYSTEM EVALUATION
NUMBER OF TRAILS NUMBER OF CORRECT NUMBER OF WRONG
IDENTIFICATION IDENTIFICATION
50 33 17
EFFICIENCY: 66%

1
SOCIAL IMPACT
• An attendance system can be maintained with the

prerecorded database of the students.
• It can be used in Military purposes for identifying the
exact speaker.
• It can also be used in security applications in homes and
banks.

1
CONCLUSION
• It is expected that the proposed method will work
successfully in recognizing the speaker even in low SNR
environments corrupted by gaussian white noise.

1
REFERENCES
• [1] D. A. Reynolds and R.C.Rose published a paper ,“Robust test-independent
speaker identification using Gaussian mixture speaker models.”IEEE Transaction on
Speech Audio Processing ,vol.3,1995,pp 72-83.
• [2] C. M Alamo, F. J. C. Gil, C. T. Munilla, and L. H. Gomez, “Discriminative
training of GMM for speaker identification,’’ in proceedings of IEEE International
Conference of Acoustic Speech Signal Processing, 1996, pp. 89-92.
• [3] C . S. Burrus,R. A. Gopinath,and H. Guo,Introduction to wavelets and wavelet
transforms, Prentice Hall, New Jersey, 1997.
• [4] I. Daubechies, “Orthonormal bases of compactly supported wavelets,”
Communication Pure Applied Mathematics, vol. 41, 1988,pp. 909-996.
• [5] J. R. Deller,J.G. Proakis, and H. L. Hansen, discrete time processing of speech
signals, Macmillan, New York, 1993.
• [6] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE
Transactions on Acoustics Speech, Signal Processing, Vol. ASSP-29, 1981 pp.254-
272

1
[7] H. C. Wang, “MAT-A project to collect mandarin speech data through telephone
networks in Taiwan,” computational linguistics and Chinese language processing,Vol.
2,1997 pp. 73-90.
[8] S. G. Mallat , “A Theory for multi resolution signal decomposition:,the wavelet
representation,”IEEE Transactions on Pattern Analysis Machine intelligence,Vol. 11,
1989, pp. 674-693.
[9] C. Miyajima,Y. Hattori,K. Tokuda, T. Masuko,T. Kobayashi, and T. Kitamura, “Text-
Independent speaker identification using Gaussian mixture models based on multi-space
probability distribution,” IEEE Transactions on information and system, Vol. E84-
B,2001, pp.847-855.
[10] B . L. Pellom and J. H. L. Hansen, “An effective scoring algorithm for Gaussian
mixture model based on speaker identification,” IEEE Signal Processing Letters,Vol.
5,1998,pp. 281-284.
[11]MJ Carey, ES Parris… -“ Robust prosodic features for speaker identification” 1996.
ICSLP 96-2002.
[12] Wark, T., and Sridharan, S., “Adaptive Fusion of Speech and Lip Information for
Robust Speaker Identification”, Digital Signal Processing11 (2001) 169–186.
[13] Gish, H.; Schmidt, M.; “Text independent speaker identification.”Signal
processing magazine,IEEE Vol. 11-(18-32).
[14] Daniel J. Mashao and Marshalleno Skosan “Combining classifier decisions for
robust speaker identification” Pattern Recognition-Volume 39, Issue 1 January 2006,
Pages 147-155.
1
[15] SR Cloude “An entropy based classification scheme for land applications of
polarimetric SAR and Remote Sensing”, IEEE on Geoscience and Remote sensing-
2002, pp.1100 – 1102.
[16] Young han lee and hong kook kim ,“Entropy coding of compressed feature
parameters for distributed speech recognition”IEEE Transactions on Speech
Communication-Volume 52, Issue 5, May 2010, Pages 405-412.
[17] DS Kim, SY Lee , “Auditory processing of speech signals for robust speech
recognition in real-world noisy environments”- IEEE Transactions on Speech and
Audio Processing, 2002,pp. 55 - 69 .
[18] E Wong, “Comparison of linear prediction cepstrum coefficients and mel-
frequency cepstrum coefficients for language identification” IEEE Transactions on
Multimedia, Video and Speech 2002,pp. 95 – 98.
[19] KY Park, “Narrowband to wideband conversion of speech using GMM based
transformation” - Acoustics, Speech, and Signal Processing , 2002,pp 1843 - 1846
vol.3.
[20] IW Selesnick , “The double-density dual-tree DWT”, IEEE Transactions on signal
processing, 2004,pp.1304 – 1314.

1
Thank You!!!

1

Robust Speaker Identification Using Wavelet Transform and Gaussian Mixture Model

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Robust Speaker Identification Using Wavelet Transform and Gaussian Mixture Model

Hochgeladen von

Copyright:

Verfügbare Formate

ROBUST SPEAKER IDENTIFICATION USING

WAVELET TRANSFORM AND GAUSSIAN

Mr. S.SELVA NIDHYANANTHAN, AP/ECE

• Maintaining a database of certain voice tracks and

• Achieving high efficiency speaker recognition using

Wednesday, December 08, 202 III REVIEW ECE-A 2

• Speaker verification refers to whether the speech sample

Wednesday, December 08, 202 III REVIEW ECE-A 3

• Unlike the text dependent system, in text-independent

• The speech extraction and the speaker dependent

Wednesday, December 08, 202 III REVIEW ECE-A 4

•Donglai Zhu characterizes a speaker by the difference between

Wednesday, December 08, 202 III REVIEW ECE-A 5

•Earlier, Linear predictive coding(LPC) was used because

Wednesday, December 08, 202 III REVIEW ECE-A 6

• Furui used cepstral analysis and mean normalization

Wednesday, December 08, 202 III REVIEW ECE-A 8

• The proposed method of Speaker Identification is based

Wednesday, December 08, 202 III REVIEW ECE-A 10

Wednesday, Dec III REVIEW ECE-A 11

• The algorithm involves speech extraction based on time

Wednesday, December 08, 202 III REVIEW ECE-A 12

Wednesday, December 08, 202 III REVIEW ECE-A 13

• Recording platform : Gold Wave 548

Wednesday, December 08, 202 III REVIEW ECE-A 14

Wednesday, December 08, 202 III REVIEW ECE-A 15

AMPLITUDE INTERPOLATED SIGNAL

Wednesday, December 08, 202 III REVIEW ECE-A 16

Wednesday, December 08, 202 III REVIEW ECE-A 17

Wednesday, December 08, 202 III REVIEW ECE-A 18

STAGE 1 STAGE 2 STAGE 3

Wednesday, December 08, 202 III REVIEW ECE-A 19

• All the entropy values within detail channels are

Wednesday, December 08, 202 III REVIEW ECE-A 20

• The Linear Predictive Cepstral Coefficients(LPCC)

Wednesday, December 08, 202 III REVIEW ECE-A 21

• The Cepstral coefficients can be computed by the

• Where a(m), m=1,2…p are LPC coefficients and p is

Wednesday, December 08, 202 III REVIEW ECE-A 22

• The final features are calculated by the formula given

Wednesday, December 08, 202 III REVIEW ECE-A 23

• Gaussian Mixture Models (GMMs) are among the most

Wednesday, December 08, 202 III REVIEW ECE-A 24

Wednesday, December 08, 202 III REVIEW ECE-A 25

Wednesday, December 08, 202 III REVIEW ECE-A 26

Wednesday, December 08, 202 III REVIEW ECE-A 27

Wednesday, December 08, 202 III REVIEW ECE-A 28

• An attendance system can be maintained with the

Wednesday, December 08, 202 III REVIEW ECE-A 29

Wednesday, December 08, 202 III REVIEW ECE-A 30

Wednesday, December 08, 202 III REVIEW ECE-A 31

Wednesday, December 08, 202 III REVIEW ECE-A 33

Wednesday, December 08, 202 III REVIEW ECE-A 34

Das könnte Ihnen auch gefallen