Beruflich Dokumente
Kultur Dokumente
Contents
ABSTRUCT ............................................................................................................................................... 3
ACKNOWLEDGEMENT ............................................................................................................................. 4
CHAPTER 1 .............................................................................................................................................. 5
1.0 INTRODUCTION ............................................................................................................................. 5
1.1BACKGROUND ................................................................................................................................ 5
1.2PROBLEM STATEMENT................................................................................................................... 6
1.3JUSTIFICATION OF THE PROJECT ................................................................................................... 6
1.4OBJECTIVES .................................................................................................................................... 6
1.5METHODOLODY ............................................................................................................................. 6
CHAPTER 2 .............................................................................................................................................. 7
2.0 VOICE............................................................................................................................................. 7
2.0INTRODUCTION .............................................................................................................................. 7
2.1 WHAT IS VOICE ............................................................................................................................. 7
2.2VOICE PRODUCTION ...................................................................................................................... 7
2.3VOICE MECHANISM ....................................................................................................................... 8
2.4VOICE CHARACTERISTICS ............................................................................................................... 8
2.5VOICE QUALITIES............................................................................................................................ 9
2.6 VOICE BIOMETRICS ....................................................................................................................... 9
2.7 BIOMETRIC SYSTEM .................................................................................................................... 10
Chapter 3............................................................................................................................................... 11
3.0 INTRODUCTION ........................................................................................................................... 11
3.1 OPEN SET VS CLOSED SET ......................................................................................................... 12
3.2 IDENTIFICATION VS VERIFICATION ............................................................................................. 12
3.3 MODULES .................................................................................................................................... 14
CHAPTER 4............................................................................................................................................ 15
4.0 INTRODUCTION ........................................................................................................................ 15
4.1 PRE-PROCESSING ........................................................................................................................ 16
4.2.1 Frame blocking ..................................................................................................................... 16
4.2.2 Windowing ........................................................................................................................... 17
4.2.3 Truncation ........................................................................................................................... 18
4.2.4 SHORT TERM FAST FOURIER TRANSFORM ......................................................................... 19
4.2.5Mel Frequency Warping........................................................................................................ 20
4.2.6CEPSTRUM ............................................................................................................................ 22
4.2.7 LINEAR PREDICTIVE ANALYSIS. ............................................................................................. 22
1
Voice recognition
SUPERVISOR:DR P MANYERE
2
Voice recognition
ABSTRACT
Voice recognition is the process of validating a users claimed identity using features from
his/her voice.This has many applications in forensic speaker recognition ,authentication and
many other fields .Speech is divided into identification and verification. Identification is the
process of finding out which registered speaker has given an utterance whilst speaker
verification is the process of making a decision based on the available information.The
process of speaker recognition consists of feature extraction and feature matching.Feature
extraction is the process where we extract a small amount of the data from the voice signal
that can be used to represent each speaker.Feature matching involves identification of the
unkown speaker.My project consists of pre-processing signals(s.1wav to s.8wav) then
passing them through a window function calculating the short Fourier Transform FFt,
extracting its features and matching it with the stored template.Methods used for feature
extraction are Cepstral Coefficient calculation and Mel frequency cepstral
Coefficients(MFCC) and the ones used for feature matching are Gaussian Mixture
Models(GMM), Dynamic Time Warping(DTW) and VQLBG(Vector Quantization Via Linde Buizo Gray) etc
Voice recognition
ACKNOWLEDGEMENT
The success of this project hinge mainly on the inspiration and advice of many others.I would like to
express my deep gratitude towards DR. P.Manyere who has been a guiding force through his
supervision, valued feedback and encouragement for the duration of the project.Next I want to
express my appreciation to the people who have been contributing in the successful
accomplishment of this project directly or indirectly.Finally I would like to thank God for the strength
he gave me throughout the year and the love he put in my wonderful family members who have
been supportive socially ,spiritually and financially. I am grateful and thankful of this tremendous
help.
Voice recognition
CHAPTER 1
1.0 INTRODUCTION
The purpose of this project is to design and simulate a system in MATLAB that should be able to
identify an individual who is speaking through analysing the spectral contents of their voice.This
involves the digital voice analysis and recognition.The discussion starts with the theoretical
background then it will explain the problem statement ,justifications, aims and objectives,
Methodology and finally results.
1.1BACKGROUND
Long back people used to live in caves but with the advent of industrialisation many began to
build their homes. This meant that many sectors grew rapidly with marked improvements in the
home automation mainly areas such as security,culture,leisure,comfort,energy savings,
management and economic activities. As a result of this ,Speaker or voice recognition has
improved over the years. Speech has been one of the oldest and natural means of information
exchange between human beings, and for years people have tried to develop machines that can
understand and produce speech naturally.
Voice recognition is a biometric modality that uses an individuals voice for recognition
purposes. It depends on features influenced by physical structure of an individual vocal tract
and the behavioural traits of an individual.So the idea is validating a users identity using
characteristics extracted from their voices.This technique can be used for
authentication,surveillance,banking by telephone,telephone shopping,data base access
services,information services,voice mail,security control and many other areas.
Voice recognition can be classified into identification and verification. Speaker identification is
the process of finding which registered speaker provides a given utterance whilst speaker
verification is making a decision to accept or reject the identity claim.The process of speaker
recognition consist of two modules namely :- feature extraction and feature matching.Feature
extraction is the process in which we extract data from the voice signal which will represent
each speaker later. Feature matching involves identification of the unknown speaker by
comparing the extracted features from his/her voice input with the ones from a set of known
speakers.
The proposed work consists of determining what constitutes a voice, developing a code that
record and recognises an individual, developing a code for the analysis and identification of this
individual and then finally simulating the program in MATLAB.
Voice recognition
1.2PROBLEM STATEMENT
Due to the long queues being experienced in banks,to the high alarming security concerns
and the need to improve the quality of life of the disabled,and elderly people in
Zimbabwe,there is needto develop a digital system that willidentify individuals by voice and
respond accordingly.
1.4OBJECTIVES
The objectives of this research are:
Determine what constitutes a voice.
Critically review literature related to voice recognition.
Develop a code that records and recognise an individual.
Develop a code that analyses the voice and identifies an individuals voice.
Simulate the program in Matlab.
1.5METHODOLODY
Consulting the supervisor
Research on the internet
Industrial visits
Use of the library
Voice recognition
CHAPTER 2
2.0 VOICE
2.0INTRODUCTION
Voice recognition is finding out who is speaking based on the information included in speech
waves .Upon identification the system then does the respective task as per the
system.Therefore there is a need to understand what is it that makes voice suitable for such
a task.Why is it possible to identify an individual just from the voice?.This chapter looks at
all the aspects and features of voice which include : voice biometrics, voice production,
voice qualities.
2.2VOICE PRODUCTION
Voice recognition
In voice production the spoken word results from three components of voice production: voiced
sound, resonance ,and articulation.
Voiced sound :It is the basic sound produced by vocal fold vibration.
Articulation :It is the process of modifying the voiced sound.The vocal tract articulators are
the tongue, soft palate ,and the lips).The articulators produce recognisable words.
Resonance: The process of amplifying and modifying voiced sound by use of vocal tract
resonators(the throat, mouth cavity, and nasal passages).The resonators produce a persons
recognizable voice.
2.3VOICE MECHANISM
Speaking involve a voice mechanism that has three subsystems .Each subsystem is composed of
different parts of the body and has specific roles in voice production
Vibratory system
Resonating System
Voice organs
Diaphragm, chest muscles,
ribs, abdominal muscles,
Lungs.
Voice box (larynx)vocal folds
Vocal tract
:throat(pharynx),oral cavity
,nasal passages.
Muscles
Nerves
Vocal folds.
2.4VOICE CHARACTERISTICS
Human speech is greatly determined by the affective state of the speaker,such as sadness
,happiness,fear,anger,aggression,lack of energy,or drowsiness.
8
Voice recognition
Speech flow: Speech flow is the rate or pace at which utterances are produced as well as the
number and duration of temporary breaks in speaking.
Loudness:Loudness reflects the amount of energy associated with the articulation of
utterances and, when regarded as a time-varying quantity, the speakers dynamic
expressiveness.
Intonation:It is the manner of producing utterances with respect to rise and fall of pitch, and
leads to tonal shifts in either direction of the speakers mean vocal pitch
Intensity of overtones:Overtones are the higher tones which faintly accompany a
fundamental tone, thus by being responsible for the tonal diversity of sounds.
2.5VOICE QUALITIES
Voice qualities are as distinctive as our faces-no two are exactly the same.Traits that make our voice
so unique can be formed into two categories ; fundamental frequency(high and low) and intensity
(loud or soft).Other attributes fall into vocal qualities.
If we are to create an equation for an individuals unique voice it would be:
Voice Quality=vocal tract configuration + laryngel anatomy + learned component
The shape of an individuals vocal tract is partly genetic, partly learned.Necks are of different sizes
hence pharynx maybe narrow or wide.These attributes ae genetically determined except for
configurations due to trauma and disease.
Similarly,Laryngel anatomy is partially determined at birth: the length of ones vocal folds is
determined by genes.The general hydration of ones vocal fold tissues or mascular agility of the
muscles can be controlled by vocal health and training.
The learned part of the equation can be referred to as vocal hobbits. These would be items such as
rhythm and rate of speech and vowel pronunciation.
Behavioural
1. Voice
2. Gait
3. Rhythm
9
Voice recognition
Physiological
1. D.N.A
2. Palm print
3. Finger print
4. Face recognition
5. Hand geometry
6. Iris recognition
Identification
Verification
Figure 4.The two fundamental tasks of speaker recognition which are identification and verification.
10
Voice recognition
Chapter 3
Principles of speaker recognition.
Speaker recognition is a biometric system that perfoms the computing task of validating a users
claimed identity using the characteristic features extracted from their speech samples.It can be
classified as follows.
11
Voice recognition
Speaker
recognition
Text independent
Open set
Identification
Vs
Vs
Vs
Text independent
Close set
verification
cl
Voice recognition
13
Voice recognition
3.3 MODULES
The two main modules of speaker recognition are feature extraction and feature matching
Feature extraction:
It is the first and most important part of speech recognition as it distinguishes one speech
from the other.Feature extraction also converts speech waveforms into a set of feature
vectors used for analysis.
Feature matching:This process follows feature extraction and it matches the stored
template with the features extracted from the input speech signal.
14
Voice recognition
CHAPTER 4
FEATURE EXTRACTION
4.0 INTRODUCTION
The aim of this chapter is to convert the speech waveform into a set of feature vectors for
analysis. This process if often called signal processing front end.
The speech signal is usually represented as a quasi-stationery signal for ease of analysis. A
quasi-stationery signal is a slowly timed varying signal. So in this case the signal behaves as
though it were stationery.
Short time spectral analysis is the most appropriate method to characterise a speech signal
.This comes after the discovery that when examined over a short period time( between 5
and 100ms),the characteristics of the of the speech signal will be fairly stationery and as the
time lengthens(1/5s or even more )the characteristics change.
There are a variety of possible methods for parametrically representing the speech signal for
the speaker recognition purpose. These methods are: Linear Prediction coding(LPC),Linear
Predictive Cepstral Coefficients Mel Frequency Cepstrum Coefficients(MFCC),Cepstral
Coefficients using DCT ,AMFCC ,Perceptual linear prediction(PLP),Power Spectral analysis
,Relative spectra filtering of log domain coefficients(RASTA),First Order
Derivative(DELTA),Discrete wave transform(DWT) .These methods will be thoroughly
explained and the best method will be chosen based on their advantages and
disadvantages.
15
Voice recognition
4.1 PRE-PROCESSING
Before feature extraction the signal has to be pre- processed ,that is going through various methods
of signal conditioning.(why).These tasks are:
Frame blocking
Windowing
Truncation
Short term Fourier Transform
Mel frequency warping
Cepstrum
Mel Frequency Cepstral Coefficients(MFCC)
Perceptual linear prediction(PLP).
Power Spectral analysis (FFT)
First Order Derivative(DELTA),
Discrete wave transform(DWT)
Relative spectra filtering of log domain coefficients(RASTA).
16
Voice recognition
4.2.2 Windowing
After frame blocking windowing follows next.This is done so as to reduce discontinuities at the
edges of each frame.The idea will be to reduce spectral discontinuities using a window at the start
and end of each frame. Given that the windowing function is defined as w(n), 0< n<N-1 , and N
stands for the total samples in each frame , then the resulting signal will be;
01
y(n)=x(n)w(n).
In general hamming windows are used and they have the form:
0 1
17
Voice recognition
4.2.3 Truncation
The set value of sampling frequency of a wavread command is 44100Hz.Recording an audio clip for
twos seconds would mean that the nu,ber of samples is ninety thousand which woulb be a lot to
handle.Therefore there is need to truncate a signal by selecting a particular threshold value at the
same time traversing the time axis in the positive direction.To find for the end we can repeat the
algorithm above in the negative direction.
18
Voice recognition
2/
= 1
,
=0
k=0,1,2,.,N-1
Where are complex numbers and we take only their abosolute values.
The resulting sequence { } is interpreted as follows: positive frequencies 0 < /2
19
Voice recognition
4.2.4.1Disadvantages of FFT
Fast fourier transform is not suitabable for the for the analysis of non stationery
signals because it provides only the frequency information of a signal and does not
provide the information about the time which the frequency is present.
20
Voice recognition
The mel scale has a band pass filter bank which that is spaced uniformly on the mel scale as
shown in figure above.In the band pass filter,the spacing and bandwidth are determined by
the mel frequency interval.The filter has a triangular frequency response and the mel
spectrum coefficients K are usually chosen to be 20. Note that this filter bank is applied in
the frequency domain, thus it simply amounts to applying the triangle-shape windows as in
the Figure 4 to the spectrum. A mel-wrapping filter bank can be taken as for each filter a
histogram bin in the frequency domain.A histogram bin is were bins have overlap.
21
Voice recognition
4.2.6CEPSTRUM
The name cepstrum comes from spectrum,and this was achieved by reversing the first four
letters of spectrum.By definition cepstrum is the result of taking the inverse fourier transform
of the log of the assumed spectrum of a signal.There are different types of cepstrums and
amongst them are complex cepstrum,real cepstrum,power cepstrum,and phase cepstrum.
The real cepstrum uses the logarithmic function in order to define the real values.It
utilises the maginitude of the information.
Complex cepstrum uses the logarithmic function to define the complex values.It
keeps the information about the magnitude and phase of the initial spectrum.
Cepstrum can be expressed as Cepstrum of signal=FT(log(FT(the signal))+j2iim)
where m is the interger part required to unwrap the angle.
Algorithmically we can say Signal-FT-phase unwrapping-FT-Cepstrum.
1
2
( )
=
=1 (log ) cos [
],
= 1,2, .
Spectrum
Fourier transform
Logarithm
Cepstrum
22
Voice recognition
Speech signal
Residual Error.
() =
= ()( );
Where
Nw is the length of the window.
Sw is the windowed segment.
The diagram below illustrates the process of LPCC.
I
I
i
k
s
d
i
23
Voice recognition
Input speech.
LPC analysis
Pre-Emphasis.
Frame Blocking.
Auto-Correlation
analysis.
Hamming Window
Cepstrum Analysis.
4.2.8.1Advantages of LPCC
Resources required are low.
It is easy to implement.
It has a high popularity.
4.2.8.2Disadvantages of LPCC
Number of speakers:Single speaker only
Number of languages:single language
Vocabulary size:small to moderate below the three hundred word mark.
Not accurate for long sentences or words.
24
Voice recognition
Equal Loudness.
Intensity Loudness.
Figure 19
The main Perceptual aspects are :Critical band resolution curve ,Intensity loudness power- law
relation and Equal loudness curve .All these are approximated by PLP.
given by
() = (()) + (())
The bank Frequency is given by
() = 6ln[1200 + [(1200) + 1]
0.5
The above is achieved after applying a frequency warping into the Bark filter ,the first step being
conversion from Frequency to bark which better represents the human hearing resolution in
frequency.The power spectrum of the simulated critical band masking curve convolutes with the
auditory warped spectrum.This is done so as to simulate the critical band integration of the human
hearing.The bark filter bank constitutes frequency warping ,smoothing and sampling intergrated
together.After the Bark filter is the Equal Loudness pre emphasis weight.It weighs the filter bank
outputs to simulate the sensitivity of hearing.The equalised values are then transformed by the
power law,this is achieved by raising each power of 0.33.The auditory warped line that results is
further processed by (LP).
25
Voice recognition
Pre-emphasis
Input speech
Frammin
DFT
Critical Bark
analysis
windowing
Log( )
Rasta filtering
Equal loudness
Intensity
Inverse log
Loudness
Powerlaw
Auto-regressive
modelling
Cepstral Domain
Transfer
Rasta PLP
26
Voice recognition
27
Voice recognition
Speech signal
Pre-emphasis
Framing
windowing
FFT
Log( )
DCT/FFT
Mel Cepstrum
Figure 21
MFCC DERIVATION
4.3.1.0Advantages of MFCC
Resources required are low to moderate
28
Voice recognition
MFCC approximates the human system response more closely this is a result of
positioning the frequency bands logarithmically.
Popularity is High
Ease of implementation: Easy to moderate
Number of speakers: Multi speaker
Vocabulary size: Moderate to large
4.3.0.0Disadvantages of MFCC
MFCC values are not very robust when there is additive noise. Hence it is the norm to
normalise their values in speech recognition systems to reduce the influence of noise.
4.3.1DTW
Speech is a non stationery signal and has short high frequency bursts and long quasi elements.This
has the effect that the Fourier transform will not be the most suitable method for analysis since it
only provides the frequency components of a signal and does not provide the information about the
time which the frequency is present. This leaves us with the Discrete wavelet Transform being the
most appropriate as it is a flexible time frequency window and deals much better with non
stationery signals.
The wavelet breaks down signals over translated and dilated mother wavelets. Mother wavelets are
defined as the time functions with fast decay and finite energy. The many types of the single wavelet
are orthogonal to each other.
The continuous wavelet transform is given by
(, ) =
1
() (
() =
[]2
(2 )
=0
Voice recognition
=1
() = =0 [] 2 (2t-n)
Where
(t) is the scaling function.
h[n] is the impulse response to a low pass filter.
g[n] is the impulse response of a high pass filter.
The wavelet functions and scaling are implemented using a pair of filters h[n] and g[n].The two
filters are quadrature mirror filters and they satisfy the property g[n]=(-1)(n-1)h(1-n).The input
signal is subjected to a low pass filtering to give the approximate components components and high
pass filtered for the delta components of the input speech.
Next is dyadic decomposition ,this is when we decompose the approximate signal low pass filter and
high pass filtered. This is done so as to get the approximate and detail components. Dyadic
decomposition separates the input signal bandwidth. Into the logarithmic set of bandwidths, whilst
the uniform decomposition will divide it into sets of uniform bandwidth.
DWT is a good technique in solving frequencies very well as with speech, high frequencies will be
briefly at the beginning of a sound while lower frequencies are presented after for a quite long
period.The discrete wavelet parameter has information about different frequency scales and this is
useful as it provides speech information of corresponding frequency band.
30
Voice recognition
Speech signal
Pre processing
Framming
windowing
LPC
LPC
LPC
LPC
Concatenation.
DWLP
31
Voice recognition
Advantages of DTW
1. There is localisation simultaneously both in time and frequency domain.
2. DWT give better recognition accuracy than MFCC and LPC.
3. It has better energy compaction, hence it has been used in feature extraction in place of Mel
filtered sub band energies.
4. DTW has better time resolution as compared to Fourier transform.
5. A wavelet transform can be used to breakdown a signal into component wavelets.
6. It is very fast computationally.
7. DWT can model the details of unvoiced sound portions effectively.
8. Wavelet Transform is able to separate the fine details in a signal
Disadavantages of DWT
The Continuous wavelet transform contains high redundancy while analysing the signals.
The cost of computing DWT may be higher than that need in DCT and other methods.
It needs longer compression time.
SUMMARY
Through implementing the above procedures a set of (MFCC) meaning Mel Frequency
Cepstrum Coefficients is computed for each frame.
Each input speech utterance is computed into a sequence of acoustic vectors.
32
Voice recognition
CHAPTER 5
33
Voice recognition
5.0INTRODUCTION
Spectal analysis refers to the techniques of estimating the power of the frequency components of a
signal.Many naturally occurring events are oscillating in nature and have frequency dependency for
example speech signal and weather components.
Speaker recognition is made simpler by a branch of engineering which deals with pattern
recognition.The main purpose of pattern recognition is to intelligently put the objects of interest into
a number of classes.From the input speech we extract the acoustic vectors which form
patterns(objects) and in this case the classes of interests refer to the speaker such that the
classification procedure is applied on extracted features ,it can also be termed feature matching.
In addition, if we have a set of already known individual classes,then the challenge can be scaled
down to supervised pattern recognition.These patterns make up the training set and classification
algorithm is derived from them.The other remaining patterns then comprise the test set.
This chapter looks at the many methods that can be used to carry out feature matching.These
methods are MFCC approach,FFT approach,Vector Quantization,DTW,GMM,HMM.All these
methods will be thoroughly explained and a final decision reached as to which approach to use
based on the advantages and disadvantages of each.
Vector Quantization(VQ)
Fast Fourier Transform(FFT)
Dynamic Time Warping(DTW)
Gaussian Modelling Models(GMM)
Hidden Markov Modells(HMM)
34
Voice recognition
as it is impossible to represent to represent every single feature vector in the space that is generated
from the training utterance of the corresponding speaker.
In figure . Below ,two speakers and two dimensions are represented.The circles show the acoustic
vectors for two speaker one and the triangles for speaker two.During the training phase,a speaker
VQ codebook is made for each known speaker by clustering the training acoustic vectors.The
codewords which result are shown in block circles and block triangles for speaker one and two
respectively.So the main principle here is to vector quantize an input utterance using each trained
codebook and this process makes up the recognition phase.To determine or identify a speaker from
the input utterance ,the total VQ distortion is calculated and the the speaker with the smallest total
distortion is the one who is identified.
Voice recognition
done by an algorithm called LBG[LIndi,Buzo and Gray,1980].The LBG algorithm clusters a set of L
training vectors into M codebook vectors.
Figure 5.3
36
Voice recognition
Figure 24
37
Voice recognition
It produces complex output of real valued input and this is not trivial to deal with
With fft we would need to compute the magnitude and the phase information
We also have to find the negative and the positive frequency components
Signal power information is unavailable.
max(, ) + 1
38
Voice recognition
(, ) = min{
=1
Where:
The K in the denominator stands for the compensation as the warping paths may have
different lengths.The above path is found efficiently using Dynamic programming.
Dynamic Programming
It determines the maximum path providing the best warp between a given speaker and a test
utterance.A cumulative distance matrix is generated from the eclidian distances already
obtained.
The process of generating the cumulative matrix is a s follows:
1. Use the constellation given below. P1,P2 and P3 show the three different paths.The
best path is the one which has the least Eclidian distance.
P1
P2
P3
2. Take theinitial condition g(1,1)=d(1,1) whered and g are Eclidian matrix and
cumulative distance matrix respectively.
3. Calculate the first row g(I,1)=g(i-1,1)+d(I,1).
4. Compute the 1st column g(1,j)=g(1,j)+d(1,j).
5. Proceed to the 2nd row g(I,2)=min(g(I,1),g(i-1,1),g(i-1,2))+d(i,2).
6. Continue from left to right and from bottom to top with the rest of the grid
g(i,j)=min(g(I,j-1),g(i-1,j-1),g(i-1,j-1),g(i-1,j))+d(I,j).
7. Find out the value of g(n,m).
5.4.1ADVANTAGES OF DTW
It is simple
It is efficient
It is fast
Simple and easy hardware.
5.4.1.1DISADVANTAGES OF DTW
It does not take into account vocal tract information of a particular user
39
Voice recognition
5.4.1.2Speaker identification
Consider a set of unknown speakers with known feature vectors. When asked to identify
an unknown speaker assuming that the speaker is one whose voice sample we already have,
the first step would be feature extraction .when we obtain the feature vectors we then try to
warp the unknown vectors with respect to a reference speaker.We follow dynamic
programming and calculate g(n,m).This procedure is repeated for all available speakers and
the least value of g(n,m) gives the identity.
40
Voice recognition
5.5.1.0
HMM describes a two stage stochastic process.Markov chain forms the first stage and in the
second stage for every point in time t an output is made.The following steps describe an
HMM for discrete symbol.
N,the number of hidden states in the model.We label the states as N={1,2..,N}, and
denote the state at time t and
M,the number of distinct observation symbols per state.V={1 , 2 . . }
State the transition probability distribution, = { }, where = [+1 = =i]
The observe probability distribution in state j,B={ ()}, where () = [ =
= ], 1
The initial state distribution , = [ = ], 1
Training the HMMs
Each speaker s in the database must have a corresponding HMM , where the model
parameters = (, , ) maximise the likelihood of the training dataset.
The following diagram illustrate the steps for estimating the model parameters.
41
Voice recognition
Speech signal
Feature
Extraction
MFCC
Vector
Quantizer
Observation Sequence
Forward
backward
algorithm
HMM
= (, , )
42
Voice recognition
5.6.INTRODUCTION
Gausian Mixture Model is a non parametric method for speaker recognition.Feature vectores
in d-dimension after clustering resemble Gaussian distribution This implies that each cluster
can be seen as a probability distribution and features corresponding to the clusteres can be
best represented by their probability values.Speech analysis is done first (Feature
extraction).Then Gaussian Mixture Speaker Model and its parameterized is analysed. Gausian
mixture density which is employed in speaker recognition is motivated by two facts.
43
Voice recognition
5.7.Speech analysis
Linear Predictive coefficients cepstral and reflection coefficients have been employed most
frequently in speaker recognition systems.However these have failed as a result of the effects
of noise.Recent studies have pointed out that directly computed filter bank features are more
robust for noisy speech recognition.So the magnitude spectrum from a quasi-stationery signal
is first pre-emphasised and processed by mel-scale filter bank.The energy filter outputs are
then cosine transformed to give cepstral coefficients.
5.8MODEL DESCRIPTION
A gausian mixture density is given by
( |) =
=1 ( )
Where
is the vector p.
is the mixture weight of the ith component.
44
Voice recognition
()
1
2 2 | |
2
exp{1/2(
) 1
)}
(
=1 = 1
We represent the Gausian Mixture density by the mixture weights, of all component
densities, mean vectors, covariance matrices and mean vectors.
The different parameters are shown as follows:
= { , ,
} = 1, .
45
Voice recognition
Mean
}
=1 { |,
, ))
=1 ( |
Mixture weights :
= 1/ =1 (|
,
)
Variances:
2 =
)2
=1 ( |,
)
=1 ( |,
p(| , ) =
=1 (
5.9SPEAKER IDENTIFICATION
A set of models models is obtained after modelling each users Gausian mixture models.Each
individual Model represents Gausian distribution components present,
Given k number of speakers:
It is represented by = {1 , 2 , 3 , . }.
We finally end up with the speaker model , with maximum probability for a given test
utterance.
We can represent it mathematically as
( | ) Pr( )
= arg 1 ( |) = arg 1
()
5.9.1ADVANTAGES OF GMM
46
Voice recognition
5.92DISADVANTAGES OF GMM
Efficiency degrades a bit when the number of components used becomes high.
47
Voice recognition
CHAPTER 6
6.0RESULTS
In this project feature extraction was done by the Mel Frequency Cepstral
Coefficients(MFCC).The speaker was then modelled using (V.Q).A Vector quantizer code
book was generated by clustering the feature vectors of each speaker and then kept in the
speaker database.LBG was used to do the clustering and I have found out that VQ based
clustering provides a faster speaker identification process than any other method.
In this project eight speech signals corresponding to eight speakers were stored in the train
folder.The eight signals are s1.wav,s2.wav,s3.wav,s4.wav,s5.wav,s6.wav,s7.wav and
s8.wav.These were compared with the sound files in the test folder.The following results
were obtained.
Voice recognition
Voice recognition
50
Voice recognition
Figure 29Plot of the Ecldian distance between speaker 1 and all speakers.
51
Voice recognition
Figure 30 Plot of the eclidian distances between speaker 2 and all speakers
52
Voice recognition
Figure 31 Plot of the Eclidian distance of spaker 3 and all the speakers
Figure 32Plot of the Eclidian distance of speaker 4 and all the speakers
53
Voice recognition
Figure 33 Plot of the eclidian distance between speaker 5 and all the speakers
54
Voice recognition
Figure 35 Plot of the Eclidian distance between speaker 8 and all the speakers.
55
Voice recognition
56
Voice recognition
57
Voice recognition
d = disteu(v, code{l});
dist = sum(min(d,[],2)) / size(d,1);
if dist < distmin
distmin = dist;k1 = l;
end
end
msg = sprintf('Speaker %d matches with speaker %d', k, k1);
disp (msg)
end
6.4CHALLENGES FACED
6.5CONCLUSION
The objective of this project was to build a speaker recognition system.This was done as seen
by the simulation. Eight different voice clips were successfully matched to their respective
voices. This was possible through feature extraction by MFCC(Mel frequency Cepstrum
Coefficients).The speaker was modelled using Vector Quantization upon which a VQ
codebook will be generated by clustering the training feature vectors of each speaker .In
recognising the speaker minimal Euclidian distance was used when matching an unknown
speaker.
The Vector quantization method is faster than all the other methods and it is also efficient.
58
Voice recognition
6.7RECOMMENDATIONS
The project focused more on building a text dependent and closed set voice recognition
system as only one word was uttered.As explained before a closed set is the one where we
have a limited number of speakers.This could be improved in futureby using statistical
models like Gaussian Mixture Models (GMM) ,Hidden Markov modelsI (HMM) and
learning models like Neural Networks and other associated aspects of artificial intelligence
can all be implemented to improve the project.
This would also make the project less prone to noise and would also cater for different
accents and moods.The following arears can also be looked into
VQ code takes a long time to compute
Size of the training data
Detection used.
REFERENCES
59
Voice recognition
60