Sie sind auf Seite 1von 27

Automatic Speaker Recognition

Mini Project report submitted in partial fulfillment


of the requirements for the degree of

Bachelor of Technology
in
Electronics and Communication Engineering

by

Sangeet Sagar - 15uec053

Under Guidance of
Mr. Sandeep Saini

Department of Electronics and Communication Engineering


The LNM Institute of Information Technology, Jaipur

April 2017
Copyright
c The LNMIIT 2017
All Rights Reserved
The LNM Institute of Information Technology
Jaipur, India

CERTIFICATE

This is to certify that the project entitled Automatic Speaker Recognition, submitted by Sangeet Sagar
(15uec053) in partial fulfillment of the requirement of degree in Bachelor of Technology (B. Tech), is a
bonafide record of work carried out by them at the Department of Electronics and Communication Engi-
neering, The LNM Institute of Information Technology, Jaipur, (Rajasthan) India, during the academic
session 2016-2017 under my supervision and guidance and the same has not been submitted elsewhere
for award of any other degree. In my/our opinion, this report is of standard required for the award of the
degree of Bachelor of Technology..

Date Adviser: Mr. Sandeep Saini


Dedicated to My Family and Friends
Acknowledgments

In the present world of competition there is a race of existence in which those are having will to
come forward succeed. A project work like this is similar a bridge between theoretical and practical
learning. With this willing I joined this project. First of all I would like to express my sincere thanks to
the supreme power the Almighty God who has always given me inner spirit and constant determination.
Without his grace, this project could not become reality. Next to him are my parents, whom I am greatly
indebted for my brought up with love and encouragement to this stage. I feel obliged in taking the
opportunity to sincerely thanks Mr. Sandeep Saini. I am also indebted to the entire electronics and
communication discipline for their attitude and friend behavior.
At last but not the least I am thankful to all my teachers and friends who have been always helping
and encouraging me throughout the year. I am speechless when it comes to express my thanks, but my
heart is still full of the favors received from every person.

v
Contents

Chapter Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 What is speaker recognition? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 How is Speaker Recognition different from Speech Recognition? . . . . . . . . . . . . . . . 3

4 Speech Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1 Identification and Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.1 Speaker identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.1.2 Speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4.2 Text-dependent vs Text-independent . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.1 Text-dependent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2.2 Text-independent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Speech Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


5.1 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5.2.1 Mel-frequency Cepstrum coefficients processor . . . . . . . . . . . . . . . . . 8
5.2.2 Linear Prediction Cepstral Coefficients (LPCC) . . . . . . . . . . . . . . . . . 10

6 Speaker Modelling / Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12


6.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
6.2 Gaussian Mixture Model (GMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.1 Database Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
7.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vi
Chapter 1

Introduction

Today I dialed my father from an unknown number but as soon as he heard my voice he quickly
recognized it was me. How? Does he have any speaker recognition system installed in himself? Well
may be, after all he had been hearing my voice since childhood.
Speaker recognition helps us recognize user based on their voice as voice has unique characteristics
that can be used to recognize and identify the speaker. This is exactly what we have done in our
project Automatic Speaker Recognition and Verification and have represented our work in this report.
Recognizing a speaker and then distinguishing it from others are both different tasks. This is exactly
what my father did after I called him. He first recognized my voice and then verified, it was me and
not my elder brother. Just like everyone has unique finger print, everyone has unique voice and this
uniqueness helped my father verify me from my call.
With the help of this report we will be demonstrating how we have successfully implemented a
speaker recognition system on the basis of voice samples collected from eight different speakers with
50 utterances from each speaker. For this project, we have used MATLAB as our programming platform
and wavesurfer (Linux based software) for plotting different aspects of wave signal like spectrogram,
LPC plot, energy plot etc.

1
Chapter 2

What is speaker recognition?

When we hear a speech sample what different inferences can we make. We can say what is being
said, we can also say what language is being spoken, we can determine who is the speaker and what is
his/her gender, we can also roughly find his/her age and his/her emotions too. This seems interesting
but whats more interesting is that we can tell about his/her approximate height.
Now comes the part being specifically stressed in this report i.e. speaker recognition. Automatic
Speaker recognition (ASpR) is the process of determining automatically who is the speaker based on
the information contained in the speech signal. This method uses the voice of the speaker to authenticate
his/her uniqueness and provide him full access to the system.
Speech is a very complex signal as it contains lots of information about the speaker. It has variety
of complexity levels like linguistic, acoustic and vocal tract. A change at this level can bring lot of
variations in these levels. This information contained at these different levels help us discriminate
between the speakers.

2
Chapter 3

How is Speaker Recognition different from Speech Recognition?

Speech recognition is basically the ability of a system to recognize what is being spoken irrespective
of who is the speaker. It has many applications like directing a machine to perform limited tasks like
turn right, rotate, lift etc. or direct a coffee machine to make coffee either hot or cold. In this it does not
depend who is the speaker. It must function as directed. Speech recognition systems are speaker inde-
pendent and hence they have a limited vocabulary. This limitation makes speaker recognition speaker
dependent beyond a certain extent. These systems can be efficiently used when a limited vocabulary can
solve the purpose. Speaker dependent systems permit greater vocabulary size, but at the cost of training
the system as per the user.
But on the contrary speaker recognition focuses on both the speaker and the speech. Take for an
example of security system where a person speaking his unique password. The access is restricted if the
person speaks password of another person. A new user is enrolled and this enrollment is the primary
working of the system. Guiding the speaker to iterate a string of numeric or verbal prompts summarizes
the process of enrollment. After the verbal prompts are successfully recorded, a model of the user’s vocal
patterns is generated by the system. Every individual has a unique model. When voice recognition is
complimented with other means of identification, such as username and password, a physical key or
combination, voice biometrics it turns out to be a very reliable system to judge the identity.

3
Chapter 4

Speech Modalities

4.1 Identification and Verification


This category of classification is the most important among the lot. Automatic speaker identifica-
tion and verification are often considered to be the most natural and economical methods for avoiding
unauthorized access to physical locations or computer systems. Let us discuss them in detail: -

4.1.1 Speaker identification


The process by which the authenticity of the speaker is judged is referred as Speaker Identification.

4.1.2 Speaker verification


This process is carried out after Speaker Identification. It includes verification of the claim of the
speaker. The figure given below gives an insight on how the above-mentioned processes of identification
and verification differ from each other.
Following picture showcases perfect example of speaker verification and speaker identification. In
the first Bob claims to be John and speaks the password but is rejected by the system. Here identity of
John is being verified. In the second case a third person claims to be in the system but is restricted since
he is not registered in the system.

4
Figure 4.1 Block Diagram for Speaker Identification

Figure 4.2 Block Diagram for Speaker Verification

4.2 Text-dependent vs Text-independent

Another important difference between speaker recognition systems is based on the text uttered by
the speaker during the identification process. The same has been discussed in detail below: -

4.2.1 Text-dependent

In this case the recognition process is purely on the basis on the text spoken by the user. The text
uttered by the user is exactly same to the text used during the training. It is generally a fixed text which
the user has prior knowledge of it like PIN number or a password.

5
Figure 4.3 Example of Speaker Identification and Speaker Verification

4.2.2 Text-independent
This modality does not enforce any restrictions in the linguistic content of the speech samples in-
volved in the verification process. In this the user has no knowledge about what is to be spoken. He
is free to speak anything because the system is trained with a continuous speech of the user which is a
long paragraph or a sentence of 5-6 minutes long.

6
Chapter 5

Speech Feature Extraction

5.1 Pre-Processing

Pre-processing of speech signals is considered a vital phase in the development of smart speaker
recognition system (ASpR) or speech recognition system (ASR). The preprocessing process goes as
described below.

Figure 5.1 General Steps of the Pre-Processing stage

system. The general preprocessing pipeline is depicted in the following figure.

Figure 5.2 Energy Plot of the Speech Signal

7
Figure 5.3 Noise Removal of Speaker 9

5.2 Feature Extraction

5.2.1 Mel-frequency Cepstrum coefficients processor

The MFCC processor is well explained in the block diagram given below. A sampling rate greater
than 1000Hz is used to record the speech input. The use of correct sampling frequency prevents the
menace of aliasing. MFCC contains the information for the vocal tract and has all the properties of the
speaker. Following stepwise explanation depict the exact process involved in the MFCC algorithm: -

Step 1: Pre-emphasis

In pre-emphasis step a high frequency component is emphasized and thereby increasing the energy
of the speech signal. The exact mathematical approach involved has been described below:

S(n) = X(n) − 0.95X(n − 1)

Step 2: Framing and Overlapping

In this the speech signal is broken down into numerous frames of similar frame size each of range
20-40 ms. Further hamming window is applied in the speech signal and its given by the equation below:

S(n) = X(n) ∗ W (n) = 0.54 − 0.46 ∗ Cos[2πnN − 1]; 0 ≤ n ≤ N − 14

Where s(n) denotes the output sample, x(n) is present sample, x(n − 1) is past sample.

8
Figure 5.4 Pitch Estimation using Auto-correlation Method

Figure 5.5 Block diagram of the MFCC processor

Step 3: Framing

The input speech signal is partitioned into frames with a duration which lesser than window duration.

Step 4: Fast Fourier Transform

This process converts the frames derived in step 3 into frequency domain. We make all the necessary
and remarkable conclusions by the analyzing the signal in the frequency domain..

S(ω) = f f t(X(n))

9
Figure 5.6 MFCC- Computation of the Cepstrum is followed by the above scheme

Step 5: Mel Wrapping

5.2.2 Linear Prediction Cepstral Coefficients (LPCC)


Linear Predictive Coding (LPC) analysis states that a speech sample can be approximated as linear
combination of past speech samples. LPC is based on the source-filter model of speech production.

The unknown ak , k = 1, 2p are called the LPC coefficients and can be solved by the least square
method.

10
Figure 5.7 Spectrum of the speech signal Music spoken by Speaker 9

Figure 5.8 An example of mel-spaced filterbank

11
Chapter 6

Speaker Modelling / Classification

6.1 Artificial Neural Networks

ANN are nothing but the crude electronics model based on neural structure of brain. The human
brain basically learns from the experience. ANN are computer having their architecture modelled after
the brain. The classifier used in this speech recognition is Back Propagation Neural Network (BPNN).
The backpropagation architecture takes input nodes as features based on the coefficients of MFCC and
combine of both features MFCC and LPC. Typically, neural networks are adjusted, or trained, so that
a particular input leads to a specific target output. In the illustration below, the network is accustomed,
built on a comparison of the output and the target, until the network output matches the target.

Figure 6.1 Illustration of neural networks

There are two ways to do the classification process in MATLAB. Either we can start with typing
nnstart in the command window or by using command line function. The target matrix is to be prepared
accordingly depending on number of speakers and number of utterances for each speaker.

12
6.2 Gaussian Mixture Model (GMM)
Gaussian mixture model is used for classification. A Gaussian mixture density is a weighted sum of
M component densities given by the equation

M
X
p(x|λ) = pi bi (x)
i=1

Where x is a random vector, i = 1, 2, , M are the mixture components, pi is the mixture weight and
PM
i=1 pi = 1 , bi (x) are the component densities, where

−1
1 1 X
bi (x) = b P exp[− (x − ui )0 (x − ui )]
(2π) 2 | i |0.5 2
i

P
where ui and i are the mean vector and covariance matrix, respectively.
GMM can be seen as a hybrid of unimodal Gaussian model and Vector Quantization model. It
does not have a hard distance as in VQ but instead using probabilities, which makes it capable of
approximating arbitrarily-shaped densities. And as discussed before in the MFCC section, GMM may
model some underlying acoustic classes by separating each class to a Gaussian mixture.

13
Chapter 7

Experimental Setup

7.1 Database Collection


A total of eight speakers (comprising of 5 male and 3 female speakers) speech samples were col-
lected for this project. All the voice samples were recorded in a noise controlled environment with only
an air condition running. In this we recorded a total of 50 utterances from each speaker making a total
of 400 samples at a sampling frequency of 1600 Hz. Following table gives the text uttered by each of
the speaker: -

Sex Sample Recorded Name


Speaker 1 M welcome Mr. Rahul
Speaker 2 F mobile Miss. Vaidehi Sharma
Speaker 3 F dream Miss. Sandhya Soni
Speaker 4 F laptop Miss. Shipra Bhatia
Speaker 5 M passion Mr. Nandit
Speaker 6 M information Mr. Aditya Singh Sengar
Speaker 7 M liberty Mr. Vaibhav Agrawal
Speaker 9 M music Mr. Sangeet Sagar

7.2 Feature Extraction


We implemented MFCC and LPC algorithm for determining the cepstral coefficients. For finding
MFCC we used MIRtoolbox. MIRtoolbox is a MATLAB toolbox dedicated to the extraction of musical
features from audio files, including routines for statistical analysis, segmentation and clustering. We
take only 13 coefficients of MFCC as it contains the majority of the information of the message signal.
We can take more than 13 coefficients too but that will be no use and it would also contain undesirable
information. For LPC coefficients we used in-built command in MATLAB. We generally take only 11 or
12 coefficients which calculating LPC coefficients and the first coefficient is always 1. Taking more than
11-12 coefficients generally takes up undesirable information about the speech signals like background

14
Figure 7.1 Speech Signal Waveform and Plot of Periodogram Power Spectral Density

noise. To increase the accuracy of the feature extraction process these LPC coefficients is vertically
concatenated (using vertcatfunction in MATLAB) to the MFCC matrix to obtain 25 coefficients for
each sample.

Figure 7.2 Mel-frequency Cepstrum coefficients (MFCC for first 10 utterances of Speaker 9

15
Figure 7.3 MFCC Plot of 8 Speakers

16
7.3 Classification

For classification purpose, we used Artificial Neural Networks (ANN) and Gaussian Mixture Model
(GMM). We implemented neural networks using MATLABs pattern recognition app. In this we present
input data to the network as our MFCC matrix which contains coefficients of the all the samples by
8 speakers and target matrix for defining desired network output. This app automatically selects 70%
of the samples for training, 15% for validation and remaining 15% for testing and we then train the
network.

Figure 7.4 Confusion matrix obtained form (a) Only MFCC features (99.8%) (b) Combining MFCC
and LPC features (100%)

Figure 7.5 Performance Plot for (a) MFCC and (b) MPCC and LPC both combined. We observe that
our training and testing lines approach in the same direction in similar manner

17
Below is our algorithm for GMM classfier: -
% Recognition Part
data=dataO’;O T size(data);
prior0 = normalise(rand(Q,1));
transmat0 = mk stochastic(rand(Q,Q));mu0, Sigma0 mixgauss init(Q*M, data,
cov type);
mu0 = reshape(mu0, [O Q M]);
Sigma0 = reshape(Sigma0, [O O Q M]);
mixmat0 = mk stochastic(rand(Q,M));LLO, prior1O, transmat1O, mu1O, Sigma1O,
mixmat1O ..
mhmm em(data, prior0, transmat0, mu0, Sigma0,mixmat0,’max iter’, p);

7.4 Results
Using ANN

18
Using GMM

Figure 7.6 On testing with voice sample of Speaker 8, on the plot above it can be seen that it has the
highest probability of recognition as speaker 8 exactly.

19
Chapter 8

Conclusion and Future Work

The results derived using MFCC technique and LPC algorithm are appreciable. MFCC for each
speaker were computed and then mapped into the target matrix. We observed a 99.8% recognition ac-
curacy using neural networks classification while combining MFCC and LPC has recognition accuracy
of 100%. This high recognition is due to limited database i.e. 8 speakers only and since this project was
performed on text dependent verification.
Then we went for another method for speaker recognition known as GMM algorithm. GMM models
are motivated by the facts that vocal tract information of a speaker follows Gaussian distribution and
Gaussian model approximates the feature space as a smooth surface. Accuracy obtained using GMM we
excellent which clearly indicates its high efficiency. Unlike neural networks is doesnt create networks,
it finds mean, prior, covariance etc. of each speech sample.
There are many future scopes in this project. Our project mainly comprises of text dependent verifica-
tion. This is a major setback. What if the user forgets his password, he will be unbale to access the
system although he is authenticated to do so. In this condition, the system should be trained in such a
way any word spoken by user should be verified by the system. We can use continuous speech recog-
nition instead of isolated words for training our system. This will have higher accuracy. In continuous
speech, what exactly happens is the sentences are broken into words and the words are further broken
in phoneme level. There may be unlimited number of words and sentences but there are limited number
of phonemes and we can utilize this fact in helping the system recognize any word spoken by the user.

20
Bibliography

[1] Campbell, J.P., Jr. Speaker recognition a tutorial Proceedings of the IEEE, vol. 85, no. 9, pp. 1437
1462, 1997

[2] Yiyin Zhou. Speaker Identification E6820 Spring 08 Final Project Report.

[3] Ashish Kumar Panda (107ec016) Amit Kumar Sahoo (107ec014). STUDY OF SPEAKER
RECOGNITION SYSTEMS.

[4] Daniel Garcia-Romero. Robust Speaker Recognition based on Latent Variable Models- 2012.

[5] Ellis, D. and Lee, analysis of everyday sounds, http://www.ee.columbia.edu/


dpwe/e6820/lectures/L12-archives.pdf

[6] Seiichi Nakagawa, Member, IEEE, Longbiao Wang, Member, IEEE, and Shinji Ohtsuka. Speaker
Identification and Verification by Combining MFCC and Phase Information IEEE TRANSAC-
TIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012

[7] A. E. Rosenberg, Automatic speaker verification: A review, Proc. IEEE, vol. 64, no. 4, pp. 475487,
Apr. 1976.

[8] R. M. Hegde, H. A. Murthy, and G. V. R. Rao, Application of the modified group delay function
to speaker identification and discrimination, in Proc. ICASSP, 2004, pp. 517520.

[9] D. A. Reynolds, T. F. Quatieri, and R. Dunn, Speaker verification using adapted Gaussian mixture
models, Dig. Signal Process., vol. 10, no. 13, pp. 1941, 2000

21