Beruflich Dokumente
Kultur Dokumente
Bachelor of Technology
in
Electronics and Communication Engineering
by
Under Guidance of
Mr. Sandeep Saini
April 2017
Copyright
c The LNMIIT 2017
All Rights Reserved
The LNM Institute of Information Technology
Jaipur, India
CERTIFICATE
This is to certify that the project entitled Automatic Speaker Recognition, submitted by Sangeet Sagar
(15uec053) in partial fulfillment of the requirement of degree in Bachelor of Technology (B. Tech), is a
bonafide record of work carried out by them at the Department of Electronics and Communication Engi-
neering, The LNM Institute of Information Technology, Jaipur, (Rajasthan) India, during the academic
session 2016-2017 under my supervision and guidance and the same has not been submitted elsewhere
for award of any other degree. In my/our opinion, this report is of standard required for the award of the
degree of Bachelor of Technology..
In the present world of competition there is a race of existence in which those are having will to
come forward succeed. A project work like this is similar a bridge between theoretical and practical
learning. With this willing I joined this project. First of all I would like to express my sincere thanks to
the supreme power the Almighty God who has always given me inner spirit and constant determination.
Without his grace, this project could not become reality. Next to him are my parents, whom I am greatly
indebted for my brought up with love and encouragement to this stage. I feel obliged in taking the
opportunity to sincerely thanks Mr. Sandeep Saini. I am also indebted to the entire electronics and
communication discipline for their attitude and friend behavior.
At last but not the least I am thankful to all my teachers and friends who have been always helping
and encouraging me throughout the year. I am speechless when it comes to express my thanks, but my
heart is still full of the favors received from every person.
v
Contents
Chapter Page
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
4 Speech Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Identification and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.1 Speaker identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.2 Speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Text-dependent vs Text-independent . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.1 Text-dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.2 Text-independent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.1 Database Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vi
Chapter 1
Introduction
Today I dialed my father from an unknown number but as soon as he heard my voice he quickly
recognized it was me. How? Does he have any speaker recognition system installed in himself? Well
may be, after all he had been hearing my voice since childhood.
Speaker recognition helps us recognize user based on their voice as voice has unique characteristics
that can be used to recognize and identify the speaker. This is exactly what we have done in our
project Automatic Speaker Recognition and Verification and have represented our work in this report.
Recognizing a speaker and then distinguishing it from others are both different tasks. This is exactly
what my father did after I called him. He first recognized my voice and then verified, it was me and
not my elder brother. Just like everyone has unique finger print, everyone has unique voice and this
uniqueness helped my father verify me from my call.
With the help of this report we will be demonstrating how we have successfully implemented a
speaker recognition system on the basis of voice samples collected from eight different speakers with
50 utterances from each speaker. For this project, we have used MATLAB as our programming platform
and wavesurfer (Linux based software) for plotting different aspects of wave signal like spectrogram,
LPC plot, energy plot etc.
1
Chapter 2
When we hear a speech sample what different inferences can we make. We can say what is being
said, we can also say what language is being spoken, we can determine who is the speaker and what is
his/her gender, we can also roughly find his/her age and his/her emotions too. This seems interesting
but whats more interesting is that we can tell about his/her approximate height.
Now comes the part being specifically stressed in this report i.e. speaker recognition. Automatic
Speaker recognition (ASpR) is the process of determining automatically who is the speaker based on
the information contained in the speech signal. This method uses the voice of the speaker to authenticate
his/her uniqueness and provide him full access to the system.
Speech is a very complex signal as it contains lots of information about the speaker. It has variety
of complexity levels like linguistic, acoustic and vocal tract. A change at this level can bring lot of
variations in these levels. This information contained at these different levels help us discriminate
between the speakers.
2
Chapter 3
Speech recognition is basically the ability of a system to recognize what is being spoken irrespective
of who is the speaker. It has many applications like directing a machine to perform limited tasks like
turn right, rotate, lift etc. or direct a coffee machine to make coffee either hot or cold. In this it does not
depend who is the speaker. It must function as directed. Speech recognition systems are speaker inde-
pendent and hence they have a limited vocabulary. This limitation makes speaker recognition speaker
dependent beyond a certain extent. These systems can be efficiently used when a limited vocabulary can
solve the purpose. Speaker dependent systems permit greater vocabulary size, but at the cost of training
the system as per the user.
But on the contrary speaker recognition focuses on both the speaker and the speech. Take for an
example of security system where a person speaking his unique password. The access is restricted if the
person speaks password of another person. A new user is enrolled and this enrollment is the primary
working of the system. Guiding the speaker to iterate a string of numeric or verbal prompts summarizes
the process of enrollment. After the verbal prompts are successfully recorded, a model of the user’s vocal
patterns is generated by the system. Every individual has a unique model. When voice recognition is
complimented with other means of identification, such as username and password, a physical key or
combination, voice biometrics it turns out to be a very reliable system to judge the identity.
3
Chapter 4
Speech Modalities
4
Figure 4.1 Block Diagram for Speaker Identification
Another important difference between speaker recognition systems is based on the text uttered by
the speaker during the identification process. The same has been discussed in detail below: -
4.2.1 Text-dependent
In this case the recognition process is purely on the basis on the text spoken by the user. The text
uttered by the user is exactly same to the text used during the training. It is generally a fixed text which
the user has prior knowledge of it like PIN number or a password.
5
Figure 4.3 Example of Speaker Identification and Speaker Verification
4.2.2 Text-independent
This modality does not enforce any restrictions in the linguistic content of the speech samples in-
volved in the verification process. In this the user has no knowledge about what is to be spoken. He
is free to speak anything because the system is trained with a continuous speech of the user which is a
long paragraph or a sentence of 5-6 minutes long.
6
Chapter 5
5.1 Pre-Processing
Pre-processing of speech signals is considered a vital phase in the development of smart speaker
recognition system (ASpR) or speech recognition system (ASR). The preprocessing process goes as
described below.
7
Figure 5.3 Noise Removal of Speaker 9
The MFCC processor is well explained in the block diagram given below. A sampling rate greater
than 1000Hz is used to record the speech input. The use of correct sampling frequency prevents the
menace of aliasing. MFCC contains the information for the vocal tract and has all the properties of the
speaker. Following stepwise explanation depict the exact process involved in the MFCC algorithm: -
Step 1: Pre-emphasis
In pre-emphasis step a high frequency component is emphasized and thereby increasing the energy
of the speech signal. The exact mathematical approach involved has been described below:
In this the speech signal is broken down into numerous frames of similar frame size each of range
20-40 ms. Further hamming window is applied in the speech signal and its given by the equation below:
Where s(n) denotes the output sample, x(n) is present sample, x(n − 1) is past sample.
8
Figure 5.4 Pitch Estimation using Auto-correlation Method
Step 3: Framing
The input speech signal is partitioned into frames with a duration which lesser than window duration.
This process converts the frames derived in step 3 into frequency domain. We make all the necessary
and remarkable conclusions by the analyzing the signal in the frequency domain..
S(ω) = f f t(X(n))
9
Figure 5.6 MFCC- Computation of the Cepstrum is followed by the above scheme
The unknown ak , k = 1, 2p are called the LPC coefficients and can be solved by the least square
method.
10
Figure 5.7 Spectrum of the speech signal Music spoken by Speaker 9
11
Chapter 6
ANN are nothing but the crude electronics model based on neural structure of brain. The human
brain basically learns from the experience. ANN are computer having their architecture modelled after
the brain. The classifier used in this speech recognition is Back Propagation Neural Network (BPNN).
The backpropagation architecture takes input nodes as features based on the coefficients of MFCC and
combine of both features MFCC and LPC. Typically, neural networks are adjusted, or trained, so that
a particular input leads to a specific target output. In the illustration below, the network is accustomed,
built on a comparison of the output and the target, until the network output matches the target.
There are two ways to do the classification process in MATLAB. Either we can start with typing
nnstart in the command window or by using command line function. The target matrix is to be prepared
accordingly depending on number of speakers and number of utterances for each speaker.
12
6.2 Gaussian Mixture Model (GMM)
Gaussian mixture model is used for classification. A Gaussian mixture density is a weighted sum of
M component densities given by the equation
M
X
p(x|λ) = pi bi (x)
i=1
Where x is a random vector, i = 1, 2, , M are the mixture components, pi is the mixture weight and
PM
i=1 pi = 1 , bi (x) are the component densities, where
−1
1 1 X
bi (x) = b P exp[− (x − ui )0 (x − ui )]
(2π) 2 | i |0.5 2
i
P
where ui and i are the mean vector and covariance matrix, respectively.
GMM can be seen as a hybrid of unimodal Gaussian model and Vector Quantization model. It
does not have a hard distance as in VQ but instead using probabilities, which makes it capable of
approximating arbitrarily-shaped densities. And as discussed before in the MFCC section, GMM may
model some underlying acoustic classes by separating each class to a Gaussian mixture.
13
Chapter 7
Experimental Setup
14
Figure 7.1 Speech Signal Waveform and Plot of Periodogram Power Spectral Density
noise. To increase the accuracy of the feature extraction process these LPC coefficients is vertically
concatenated (using vertcatfunction in MATLAB) to the MFCC matrix to obtain 25 coefficients for
each sample.
Figure 7.2 Mel-frequency Cepstrum coefficients (MFCC for first 10 utterances of Speaker 9
15
Figure 7.3 MFCC Plot of 8 Speakers
16
7.3 Classification
For classification purpose, we used Artificial Neural Networks (ANN) and Gaussian Mixture Model
(GMM). We implemented neural networks using MATLABs pattern recognition app. In this we present
input data to the network as our MFCC matrix which contains coefficients of the all the samples by
8 speakers and target matrix for defining desired network output. This app automatically selects 70%
of the samples for training, 15% for validation and remaining 15% for testing and we then train the
network.
Figure 7.4 Confusion matrix obtained form (a) Only MFCC features (99.8%) (b) Combining MFCC
and LPC features (100%)
Figure 7.5 Performance Plot for (a) MFCC and (b) MPCC and LPC both combined. We observe that
our training and testing lines approach in the same direction in similar manner
17
Below is our algorithm for GMM classfier: -
% Recognition Part
data=dataO’;O T size(data);
prior0 = normalise(rand(Q,1));
transmat0 = mk stochastic(rand(Q,Q));mu0, Sigma0 mixgauss init(Q*M, data,
cov type);
mu0 = reshape(mu0, [O Q M]);
Sigma0 = reshape(Sigma0, [O O Q M]);
mixmat0 = mk stochastic(rand(Q,M));LLO, prior1O, transmat1O, mu1O, Sigma1O,
mixmat1O ..
mhmm em(data, prior0, transmat0, mu0, Sigma0,mixmat0,’max iter’, p);
7.4 Results
Using ANN
18
Using GMM
Figure 7.6 On testing with voice sample of Speaker 8, on the plot above it can be seen that it has the
highest probability of recognition as speaker 8 exactly.
19
Chapter 8
The results derived using MFCC technique and LPC algorithm are appreciable. MFCC for each
speaker were computed and then mapped into the target matrix. We observed a 99.8% recognition ac-
curacy using neural networks classification while combining MFCC and LPC has recognition accuracy
of 100%. This high recognition is due to limited database i.e. 8 speakers only and since this project was
performed on text dependent verification.
Then we went for another method for speaker recognition known as GMM algorithm. GMM models
are motivated by the facts that vocal tract information of a speaker follows Gaussian distribution and
Gaussian model approximates the feature space as a smooth surface. Accuracy obtained using GMM we
excellent which clearly indicates its high efficiency. Unlike neural networks is doesnt create networks,
it finds mean, prior, covariance etc. of each speech sample.
There are many future scopes in this project. Our project mainly comprises of text dependent verifica-
tion. This is a major setback. What if the user forgets his password, he will be unbale to access the
system although he is authenticated to do so. In this condition, the system should be trained in such a
way any word spoken by user should be verified by the system. We can use continuous speech recog-
nition instead of isolated words for training our system. This will have higher accuracy. In continuous
speech, what exactly happens is the sentences are broken into words and the words are further broken
in phoneme level. There may be unlimited number of words and sentences but there are limited number
of phonemes and we can utilize this fact in helping the system recognize any word spoken by the user.
20
Bibliography
[1] Campbell, J.P., Jr. Speaker recognition a tutorial Proceedings of the IEEE, vol. 85, no. 9, pp. 1437
1462, 1997
[2] Yiyin Zhou. Speaker Identification E6820 Spring 08 Final Project Report.
[3] Ashish Kumar Panda (107ec016) Amit Kumar Sahoo (107ec014). STUDY OF SPEAKER
RECOGNITION SYSTEMS.
[4] Daniel Garcia-Romero. Robust Speaker Recognition based on Latent Variable Models- 2012.
[6] Seiichi Nakagawa, Member, IEEE, Longbiao Wang, Member, IEEE, and Shinji Ohtsuka. Speaker
Identification and Verification by Combining MFCC and Phase Information IEEE TRANSAC-
TIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
[7] A. E. Rosenberg, Automatic speaker verification: A review, Proc. IEEE, vol. 64, no. 4, pp. 475487,
Apr. 1976.
[8] R. M. Hegde, H. A. Murthy, and G. V. R. Rao, Application of the modified group delay function
to speaker identification and discrimination, in Proc. ICASSP, 2004, pp. 517520.
[9] D. A. Reynolds, T. F. Quatieri, and R. Dunn, Speaker verification using adapted Gaussian mixture
models, Dig. Signal Process., vol. 10, no. 13, pp. 1941, 2000
21