Sie sind auf Seite 1von 35

#AC"!T$ O# E!ECTRICA! AND E!

ECTRONIC ENGINEERING

ADSP ASSIGNMENT (MEE 10403)

SPEECH RECOGNITION

Code of Course Name of Course NAME 1: NAME : !ECT"RER NAME

MEE 10403 ADSP DINESHWARAN GUNALAN (HE120104) TUAN MUSTAQIM (HE100092) PROF. MADYA DR. FAIZ ABDULLAH

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

Abstract
Speaker recognition is basically divided into two-classification: speaker recognition and speaker identification and it is the method of automatically identify who is speaking on the basis of individual information integrated in speech waves. Speaker recognition is widely applicable in use of speakers voice to verify their identity and control access to services such as banking by telephone, database access services, voice dialing telephone shopping, information services, voice mail, security control for secret information areas, and remote access to computer AT and T and T with Sprint have started field tests and actual application of speaker recognition technology! many customers are already being used by Sprints "oice #hone $ard. Speaker recognition technology is the most potential technology to create new services that will make our everyday lives more secured. Another important application of speaker recognition technology is for forensic purposes. Speaker recognition has been seen an appealing research field for the last decades which still yields a number of unsolved problems.

Aim
The main aim of this pro%ect is speaker identification, which consists of capturing the voice of an individual and then playing back the voice after filtering the noise.

JEP, FKEE (Semester 1 2012/2013)

&

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

$'A#T() * +T),-.$T ,+
1.1 Project Overview /elow figure shows the fundamental formation of speaker identification and verification systems whereby the speaker identification is the process of determining which registered speaker provides a given speech. ,n the other hand, speaker verification is the process of re%ecting or accepting the identity claim of a speaker. n most of the applications, voice is use as the key to confirm the identities of a speaker is classified as speaker verification.

JEP, FKEE (Semester 1 2012/2013)

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

Adding the open set identification case in which a reference model for an unknown speaker may not e1ist can also modify above formation of speaker identification and verification system. This is usually the case in forensic application. n this circumstances, an added decision alternative, the unknown does not match any of the models, is re2uired. ,ther threshold e1amination can be used in both verification and identification process to decide if the match is close enough to acknowledge the decision or if more speech data are needed. Speaker recognition can also divide into two methods, te1t- dependent and te1t independent methods. n te1t dependent method the speaker has to say key words or sentences having the same te1t for both training and recognition trials. 3hereas in the te1t independent does not rely on a specific te1ts being speak. 4ormerly te1t dependent methods were widely in application, but later te1t independent is in use. /oth te1t dependent and te1t independent methods share a problem however. /y playing back the recorded voice of registered speakers this system can be easily deceived. There are different techni2ue is used to cope up with such problems. Such as a small set of words or digits are used as input and each user is provoked to thorough a specified se2uence of key words that is randomly selected every time the system is used. Still this method is not completely reliable. This method can be deceived with the highly developed electronics recording system that can repeat secrete key words in a re2uest order. 5AT6A/ was used for this pro%ect. To train the data there is where 5AT6A/ will takes the part through the system that been setup inside the mentioned software. The type of files below provide some Utility, Mel Spectral Coefficients, Display feature vectors, and Pattern Matching using the Dynamic Programming algorithm functions inside the 5AT6A/ that can be used to stimulate and process the set of data which come from the speech that been recorded by the system:-

JEP, FKEE (Semester 1 2012/2013)

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

Utility unction
i)

melfiltermatri!.m : $ompute the matri1 of mel filter coefficients given the sampling fre2uency, the length of the 44T and the number of desired mel filter channels.

ii) iii) iv)

mel"fre#.m : $onvert a fre2uency from mel scale to linear scale. fre#"mel.m : $onvert a fre2uency from linear scale to mel scale. loglimit.m : returns log819 or log 8limit9 if 1 is below limit.

Mel Spectral Coefficients


i)

computeSpectrum.m : $ompute the log power spectrum given the desired 44T length, the time shift of the analysis window and the speech signal.

ii)

computeMelSpectrum.m : compute the log mel spectral coefficients given the matri1 of mel filter coefficients, the time shift of the analysis window and the time signal.

Display feature vectors

JEP, FKEE (Semester 1 2012/2013)

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

i)

feature$isp.m ; feature$isp.fig: .sing these functions, record some seconds of speech and display the resulting se2uences of features vectors 8power spectrum, signal energy and 5el spectrum9

Pattern Matching using the Dynamic Programming algorithm


i)

$p%asym.m : #erform the Dynamic Programming algorithm to compute the distance between the vector se2uences &and '. )eturn the overall distance, the complete distance matri1 and a matri1 containing the optimum path, i.e. the <matching function<

ii)

$ptest.m ; $ptest.fig: record two speech signals, compute the mel spectral vector se2uences & and ' and match them using the $p%asym function. -isplay the patterns and the matching function.

1." Project Description i9 (ar$ware 4or the hardware, it was separated to two main parts which is the interface and the circuit itself. The details about the hardware are as below:-

JEP, FKEE (Semester 1 2012/2013)

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

)nterface *+he ,r$uino Uno-

The Arduino .no is a microcontroller board based on the ATmega0&>. t has *7 digital input?output pins 8of which = can be used as #35 outputs9, = analogue inputs, a *= 5'@ ceramic resonator, a .S/ connection, a power %ack, an $S# header, and a reset button. t contains everything needed to support the microcontroller! simply connect it to a computer with a .S/ cable or power it with an A$-to--$ adapter or battery to get started. The .no differs from all preceding boards in that it does not use the 4T.S/-to-serial driver chip. nstead, it features the Atmega*=.& 8Atmega>.& up to version )&9 programmed as a .S/-to-serial converter. )evision & of the .no board has a resistor pulling the >.& '3/ line to ground, making it easier to put into -4. mode. )evision 0 of the board has the following new features:

*.A pin out: added S-A and S$6 pins that are near to the A)(4 pin and two other new pins placed near to the )(S(T pin, the ,)(4 that allow the shields to adapt to the voltage provided from the board. n future, shields will be compatible with both the board that uses the A"), which operates with :" and with the Arduino -ue that operates with 0.0". The second one is a not connected pin that is reserved for future purposes.

Stronger )(S(T circuit. Atmega *=.& replace the >.&.

JEP, FKEE (Semester 1 2012/2013)

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

<.no< means one in talian and is named to mark the upcoming release of Arduino *.A. The .no and version *.A will be the reference versions of Arduino, moving forward. The .no is the latest in a series of .S/ Arduino boards, and the reference model for the Arduino platform! for a comparison with previous versions.

Circuit

The circuit consists of: ii9 Software The software section is completely based on 5AT6A/. n our interface we have used 5AT6A/ for voice reorgani@ation. t can be used into three different modes vi@., te1t to speech and vice versa as voice command recogni@er. 3e have used it in third mode. n this mode of operation we can add predefined commands. )esistant 6(Cumper wires /readboard

JEP, FKEE (Semester 1 2012/2013)

>

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

t listens command and matches it from the given list. f matching occurs it generates an event corresponding to the matching. This event is used to control the device by giving the controller input to control the output and thus control the system. 1.. Speech /ecognition System ,pplications

i)

(ealthcare
n the health care domain, speech recognition can be implemented in front-end or back-end of the medical documentation process. 4ront-(nd speech recognition is where the provider dictates into a speech-recognition engine, the recogni@ed words are displayed as they are spoken, and the dictator is responsible for editing and signing off on the document. /ack-(nd or deferred speech recognition is where the provider dictates into a digital dictation system, the voice is routed through a speech-recognition machine and the recogni@ed draft document is routed along with the original voice file to the editor, where the draft is edited and report finali@ed. -eferred speech recognition is widely used in the industry currently.

JEP, FKEE (Semester 1 2012/2013)

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

ii)

Military Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. n these programs, speech recogni@ers have been operated successfully in fighter aircraft, with applications including: setting radio fre2uencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight display. t is founded recognition deteriorated with increasing E-loads. t was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly. $ontrary to what might be e1pected, no effects of the broken (nglish of the speakers were found. t was evident that spontaneous speech caused problems for the recogni@er, as could be e1pected. A restricted vocabulary, and above all, a proper synta1, could thus be e1pected to improve recognition accuracy substantially.

iii)

+raining ,ir +raffic Controllers


Training for air traffic controllers 8AT$9 represents an e1cellent application for speech recognition systems. 5any AT$ training systems currently re2uire a person to act as a <pseudo-pilot<, engaging in a voice dialog with the trainee controller, which simulates the dialog that the controller would have to conduct with pilots in a real AT$ situation. Speech recognition and synthesis techni2ues offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel. n theory, Air controller tasks are also characteri@ed by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task should be possible. n practice, this is rarely the case. The 4AA document B**A.=: details the phrases that should be used by air traffic controllers.

JEP, FKEE (Semester 1 2012/2013)

*A

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

iv)

+elephony
The improvement of mobile processor speeds made feasible the speech-enabled Symbian and 3indows 5obile Smartphones. Speech is used mostly as a part of .ser nterface, for creating pre-defined or custom speech commands. 6eading software vendors in this field are: 5icrosoft $orporation 85icrosoft "oice $ommand9, Technology -igital 8" T, Syphon "oice&Eo9, 8Sonic Speereo (1tractor9, 6umen"o1, +uance Software 8Speereo "oice $ommunications 8+uance "oice $ontrol9, Speech Technology $enter, "ito Translator9, "erby1 ")F and S",F.

1.0 +he +op 1 Uses of Speech /ecognition +echnology The accuracy and acceptance of speech recognition has come a long way in the last few years and forward-thinking contact centre operations are now adopting this technology to enhance their operation and improve their bottom-line profitability. 1. Playing 2ac3 simple information f you have customers who need fast access to information. n many circumstances customers do not actually need or want to speak to a live operator. 4or e1ample, if they have little time or they only re2uire basic information then speech recognition can be used to cut waiting times and provide customers with the information they want. /y deploying an intelligent speech recognition system, -ublin Airport was able to cope with a 0A per cent rise in passenger numbers without the need to increase staff levels. ncoming customer calls are filtered according to re2uirements and those wanting basic information, say on Gdepartures or Garrivals, are automatically directed to the speech recognition system that 2uickly evaluates the nature of the en2uiry through a series of prompts. At all times there is an option to speak with a live operator, if necessary. The system has been fine-tuned to pick up the vagaries of the rish accent. The average

JEP, FKEE (Semester 1 2012/2013)

**

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

call time has been reduced to %ust :0 seconds, freeing up skilled agents for more comple1 calls. ". Call steering #utting callers through to the right department. 3aiting in a 2ueue to get through to an operator or, worse still, finally being put through to the wrong operator can be very frustrating to your customer, resulting in dissatisfaction. /y introducing speech recognition, you can allow callers to choose a Gself-service route or alternatively Gsay what they want and be directed to the correct department or individual. Standard 6ife is using speech recognition for its 6ife and #ensions business. The solution helps in three ways: it ascertains what the call is about, if necessary it takes the customer through security checks and then transfers the customer to the appropriate member of staff. The details that the customer has already provided appear on the screen so that they do not have to repeat the information. .sing this technology Standard 6ife increased its overall call handling capacity by over &: per cent and reduced their misdirected calls by == per cent. The system also gives them a better understanding of why customers are calling, because it allows the customer to Gvoice their re2uest rather than forcing them to conform to an organi@ations preconceptions on what the customer wants. .. ,utomate$ i$entification 3here you need to authenticate someones identity on the phone without using Grisky personal data. dentity fraud is now one of the biggest concerns facing .H organi@ations and research by the .Hs fraud prevention service 8$ 4AS9 estimates that it is costing the .H I*.Bbn a year. Some advanced speech recognition systems provide an answer to this problem using voice biometrics. This technology is now accepted as a ma%or tool in combating telephone-based crime. ,n average it takes less than two minutes to create a Gvoiceprint based on

JEP, FKEE (Semester 1 2012/2013)

*&

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

specific te1t such as G+ame and GAccount +umber. This is then stored against the individuals record, so when they ne1t call, they can simply say their name and if the voiceprint matches what they have stored, then the person is put straight through to a customer service representative. This takes less than 0A seconds and also bypasses the need for the individual to have to run through a series of tedious - checks such as passwords, address details and so on. Australias >th largest insurers, 'ealth 5anagement is successfully using voice biometrics to allow e1isting account holders to speak to customer service representatives 2uickly and securely. The company has enrolled more than &A,AAA customers voiceprints. 0. /emoving )4/ menus )eplacing complicated and often frustrating Gpush button "). -ue to poorly implemented systems, ") and automated call handling systems are often unpopular with customers. 'owever, there is a way to improve this scenario, termed Gintelligent call steering 8 $S9, it does not involve any Gbutton pushing. The system simply asks the customer what they want 8in their words, not yours9 and then transfers them to the most suitable resource to handle their call. $allers dial one number and are greeted by the message J3elcome to FKL $ompany, how can help youMN The caller is routed to the right agent within &A to 0A seconds of the call being answered with misdirected calls reduced to as low as 0-: per cent. /y introducing +atural 6anguage Speech )ecognition 8+6S)9, general insurance company Suncorp replaced its original push button "), enabling the customer to simply say what they wanted. .sing a financial services statistical language model of over *AA,AAA phrases, the system can more accurately assess the nature of the call and transfer it first time to the appropriate department or advisor. The company reduced its call waiting times to around 0A seconds and misdirected calls to virtually nil. 1. Dealing with spi3es in call volumes

JEP, FKEE (Semester 1 2012/2013)

*0

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

Kou need to handle high volumes of customer service en2uiries from repeat customers. The betting industry is an e1ample of a business that has very high volumes of calls from regular Gpunters, most of which occur in irregular peaks and troughs. -uring a normal day, races occur every ten minutes with >A per cent of calls occurring minutes before each race. To overcome this problem 6adbrokes was able to divert the calls depending simply on their nature, e.g. placing a bet, asking for odds, which were both handled automatically, or for more comple1 Gcustomised bets they could speak directly to an operator. The system is effective on all race days, but on big race days such as The Erand +ational or The $heltenham Eold $up it enables the company to increase the capacity of its call centres without the need to add additional staff. A large database of over 7A,AAA registered horses and =,AAA football players are part of an e1tensive database that is updated in real time.

$'A#T() & S#(($' )($,E+ T ,+ A)$' T($T.)(


".1 Speech eature 5!traction

JEP, FKEE (Semester 1 2012/2013)

*7

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

n this pro%ect the most important thing is to e1tract the feature from the speech signal. The speech feature e1traction in a categori@ation problem is about reducing the dimensionality of the input-vector while maintaining the discriminating power of the signal. As we know from the above fundamental formation of speaker identification and verification systems, that the number of training and test vector needed for the classification problem grows e1ponential with the dimension of the given input vector, so we need feature e1traction. /ut e1tracted feature should meet some criteria while dealing with the speech signal. Such as: (asy to measure e1tracted Speech features. -istinguish between speakers while being lenient of intra speaker variabilitys. t should not be susceptible to mimicry. t should show little fluctuation from one speaking environment to another. t should be stable over time. t should occur fre2uently and naturally in speech. n this pro%ect we are using the 5el 4re2uency $epstral $oefficients 854$$9 techni2ue to e1tract features from the speech signal and compare the unknown speaker with the e1istent speaker in the database. 4igure below shows the complete pipeline of 5el 4re2uency $epstral $oefficients.

JEP, FKEE (Semester 1 2012/2013)

*:

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

"."

raming an$ 6in$owing As shown in the figure below the speech signal is slowly varying over time and it is

called 2uasi stationery.

Above plot shows the word spoken by speaker. The recordings were digiti@ed at f samples is e2ual to **,A&: samples per second and at *= bits per sample. Time goes from left to right and amplitude is shown vertically. 3hen the speech signal is e1amined over a short

JEP, FKEE (Semester 1 2012/2013)

*=

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

period of time such as : to *AA milliseconds, the signal is reasonably stationery, and therefore this signals are e1amine in short time segment, short time segments is referred to as a spectral analysis. This means that the signal is blocked into &A-0A milliseconds of each frame. And to avoid the loss of any information due to windowing ad%acent frame is overlap with each other by 0A percent to :A percent. As soon as the signal has been framed, each frame is multiplied with the window function w (n) with length +. The function below we are using is called hamming window function 3here + O 6ength of the frame.

".. (amming 6in$ow 'amming window is also called the raised cosine window. The e2uation and plot for the 'amming window shown below. n a window function there is a @ero valued outside of some chosen interval. 4or e1ample, a function that is stable inside the interval and @ero elsewhere is called a rectangular window that illustrates the shape of its graphical representation. 3hen signal or any other function is multiplied by a window function, the product is also @ero valued outside the interval. The windowing is done to avoid problems due to truncation of the signal. 3indow function has some other applications such as spectral analysis, filter design, and audio data compression such as "orbis.

JEP, FKEE (Semester 1 2012/2013)

*B

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

".0 Ceptrum $epstrum name was derived from the spectrum by reversing the first four letters of spectrum. 3e can say cepstrum is the 4ourier Transformer of the log with unwrapped phase of the 4ourier Transformer. 5athematically we can say $epstrum of signal O 4T8log84T8the signal99 P%&Qm9 3here m is the integer re2uired to properly unwrap the angle or imaginary part of the comple1 log function. Algorithmically we can say R Signal - 4T - log - phase unwrapping - 4T - $epstrum. 4or defining the real values real cepstrum uses the logarithm function. 3hile for defining the comple1 values the comple1 cepstrum uses the comple1 logarithm function. The

JEP, FKEE (Semester 1 2012/2013)

*>

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

real cepstrum uses the information of the magnitude of the spectrum.where as comple1 cepstrum holds information about both magnitude and phase of the initial spectrum, which allows the reconstruction of the signal. 3e can calculate the cepstrum by many ways. Some of them need a phase-warping algorithm, others do not. 4igure below shows the pipeline from signal to cepstrum.

As we discussed in the 4raming and 3indowing section that speech signal is composed of 2uickly varying part e 8n9 e1citation se2uence convolved with slowly varying part 8n9 vocal system impulse response.

,nce we convolved the 2uickly varying part and slowly varying part it makes difficult to separate the two parts, cepstrum is introduced to separate this two parts. The e2uation for the cepstrum is given below:

JEP, FKEE (Semester 1 2012/2013)

*D

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

s the -iscrete Time 4ourier Transformer and is the nverse -iscrete Time 4ourier Transformer. /y moving the signal from time domain to fre2uency domain convolution becomes the multiplication.

The multiplication becomes the addition by taking the logarithm of the spectral 5agnitude

The nverse 4ourier Transform work individually on the two components as it is a 6inear

The domain of the signal cs8n9 is called the 2uefrency-domain. ".1 Mel re#uency Ceptral Coefficients *M CCn this pro%ect we are using 5el 4re2uency $epstral $oefficient. 5el fre2uency $epstral $oefficients are coefficients that represent audio based on perception. This coefficient has a great success in speaker recognition application. t is derived from the 4ourier Transform of the audio clip. n this techni2ue the fre2uency bands are positioned logarithmically, whereas in the 4ourier Transform the fre2uency bands are not positioned logarithmically. As the fre2uency bands are positioned logarithmically in 54$$, it appro1imates the human system response more closely than any other system. These coefficients allow better processing of data.

JEP, FKEE (Semester 1 2012/2013)

&A

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

n the 5el 4re2uency $epstral $oefficients the calculation of the 5el $epstrum is same as the real $epstrum e1cept the 5el $epstrums fre2uency scale is warped to keep up a correspondence to the 5el scale. The 5el scale was pro%ected by Stevens, "olkmann and +ewman in *D0B. The 5el scale is mainly based on the study of observing the pitch or fre2uency perceived by the human. The scale is divided into the units mel. n this test the listener or test person started out hearing a fre2uency of *AAA '@, and labelled it *AAA 5el for reference. Then the listeners were asked to change the fre2uency till it reaches to the fre2uency twice the reference fre2uency. Then this fre2uency labelled &AAA 5el. The same procedure repeated for the half the fre2uency, then this fre2uency labelled as :AA 5el, and so on. ,n this basis the normal fre2uency is mapped into the 5el fre2uency. The 5el scale is normally a linear mapping below *AAA '@ and logarithmically spaced above *AAA '@. 4igure below shows the e1ample of normal fre2uency is mapped into the 5el fre2uency.

.. (1) .. (2)

JEP, FKEE (Semester 1 2012/2013)

&*

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

The e2uation 8*9 above shows the mapping the normal fre2uency into the 5el fre2uency and e2uation 8&9 is the inverse, to get back the normal fre2uency.

4igure above shows the calculation of the 5el $epstrum $oefficients. 'ere we are using the bank filter to warping the 5el fre2uency. .tili@ing the bank filter is much more convenient to do 5el fre2uency warping, with filters centered according to 5el fre2uency. According to the 5el fre2uency the width of the triangular filters vary and so the log total energy in a critical band around the center fre2uency is included. After warping are a number of coefficients.

JEP, FKEE (Semester 1 2012/2013)

&&

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

4inally we are using the nverse -iscrete 4ourier Transformer for the cepstral coefficients calculation. n this step we are transforming the log of the 2uefrench domain coefficients to the fre2uency domain. 3here + is the length of the -4T we used in the cepstrum section.

".7 4ector 8uanti9ation A speaker recognition system must able to estimate probability distributions of the computed feature vectors. Storing every single vector that generate from the training mode is impossible, since these distributions are defined over a high-dimensional space. t is often easier to start by 2uanti@ing each feature vector to one of a relatively small number of template vectors, with a process called vector 2uanti@ation. "S is a process of taking a large set of feature vectors and producing a smaller set of measure vectors that represents the centroids of the distribution.

Fig 3.1 the vectors generated from training before VQ

JEP, FKEE (Semester 1 2012/2013)

&0

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

The techni2ue of "S consists of e1tracting a small number of representative feature vectors as an efficient means of characteri@ing the speaker specific features. /y means of "S, storing every single vector that we generate from the training is impossible.

Fig 3.2 the representative feature vectors resulted after VQ /y using these training data features are clustered to form a codebook for each speaker. n the recognition stage, the data from the tested speaker is compared to the codebook of each speaker and measure the difference. These differences are then use to make the recognition decision. ".: ;<means ,lgorithm The ;<means algorithm is a way to cluster the training vectors to get feature vectors. n this algorithm clustered the vectors based on attributes into k partitions. t use the k means of data generated from Eaussian distributions to cluster the vectors. The ob%ective of the kmeans is to minimi@e total intra-cluster variance, ".

where there are k clusters Si, i O *,&,...,k and Ti is the centroid or mean point of all the points .

JEP, FKEE (Semester 1 2012/2013)

&7

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

Fig 3.3 clusters in the k-mean algorithm The process of k-means algorithm used least-s2uares partitioning method to devide the input vectors into k initial sets. t then calculates the mean point, or centroid, of each set. t constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated until when the vectors no longer switch clusters or alternatively centroids are no longer changed. ".= Distance Measure n the speaker recognition phase, an unknown speakers voice is represented by a se2uence of feature vector U1*, 1& V.1i9, and then it is compared with the codebooks from the database. n order to identify the unknown speaker, this can be done by measuring the distortion distance of two vector sets based on minimi@ing the (uclidean distance.

JEP, FKEE (Semester 1 2012/2013)

&:

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

The (uclidean distance is the <ordinary< distance between the two points that one would measure with a ruler, which can be proven by repeated application of the #ythagorean Theorem. The formula used to calculate the (uclidean distance can be defined as following: The (uclidean distance between two points # O 8p*, p&Vpn9 and S O 82*, 2&...2n9,

The speaker with the lowest distortion distance is chosen to be identified as the unknown person.

JEP, FKEE (Semester 1 2012/2013)

&=

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

$'A#T() 0 )(S.6TS
..1 >raphical User )nterface *>U)-

*. To get started, type <guide< in 5atlab. 6etWs start with a blank E. .

JEP, FKEE (Semester 1 2012/2013)

&B

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

&. Eet a blank form that we can place controls on. /efore %ump in, 5ake sure we have a plan how to lay out our E. . f you press the green arrow at the top of the E. editor, it will save your current version and run the program. The first time we run it, it will ask to name the program. ,ur figure looks about right, but it doesnWt do anything yet. 3e have to define a callback for the button so it will plot the function when we press it. 3riting $allbacks, when we run the program, it creates two files .gui.fig, contains the layout of your controls gui.m, contains code that defines a callback function for each of our controls. 3e generally donWt mess with the initiali@ation code in the mfile. 3e will probably leave many of the control callbacks blank. n our e1ample, we %ust need to locate the function for the button. This is why it is important to have a good tag so we can keep our controls straight. 3e can also right-click on the control and select "iew $allback. )unning our program when we modify our m-file code, we donWt have to re-run our E. . The new code will run as a callback, so we can 2uickly test changes. To run our E. , in the workspace we %ust type in the name.

JEP, FKEE (Semester 1 2012/2013)

&>

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

0. Above is the how the E. looks ready for applicable. The /5CO/D tab will record the voice using a microphone. The P?,' tab will play back our voice after filtering out the noise. The S,45 and ?O,D button allow us to save the voice and load back at the 5AT6A/ folder in 5y -ocuments. @OOM )A and OU+ allows us to @oom into the spectrum to do any sort of analysis we re2uire. At the left top corner, there is a slider that gives us the option for recording time. The range is from * R = seconds.

JEP, FKEE (Semester 1 2012/2013)

&D

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

.." (ar$ware

Above is the Arduino .+, board which is the interface between the laptop and the circuit. The green light on the Arduino .+, board shows that it is ready and connected to 5AT6A/ software. f its red light it means it is still being detected by the software.

Above is the 6(- circuit to confirm if the speech recognition is working or not. The crystal clear 6(- represents if the voice captured is same as the voice being played after

JEP, FKEE (Semester 1 2012/2013)

0A

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

removing the noises. The red 6(- represents when the voice is being recorded as well as preprocessing the voice and removing the noise.

Above shows the system is currently recording the individual voice and is preprocessing it to remove the noise.

Above shows that the voice played out is same as the voice recorded minus the noise.

JEP, FKEE (Semester 1 2012/2013)

0*

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

Above shows the voice spectrum after being analy@e. As you can see there are two syllabuses being pronounce. The trial word being used was GL(), which was separated into GL( and G),. There is silence at the start and the end of the syllabuses.

JEP, FKEE (Semester 1 2012/2013)

0&

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

/(4,)( 3 +-,3 +E

A4T() 3 +-,3 +E

JEP, FKEE (Semester 1 2012/2013)

00

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

As you can see the voice before windowing looks rough and huge from the spectrum outlook. 'owever after windowing, the voice become smoother and concentrated as all the unrelated noises has been removed.

$'A#T() 7 $,+$6.S ,+
0.1 (ar$ware

The goal of this pro%ect was to create a speaker recognition system, and apply it to a speech of an unknown speaker. /y investigating the e1tracted features of the unknown speech and then play back to check if the voice played back was correct or not. The feature e1traction is done by using 54$$ 85el 4re2uency $epstral $oefficients9. The function Gmelcepst is used to calculate the mel cepstrum of a signal. The speaker was modeled using "ector Suanti@ation 8"S9. A "S codebook is generated by clustering the training feature vectors of each speaker and then stored in the speaker database. n this method, the H means algorithm is used to do the clustering. n the recognition stage, a distortion measure which based on the minimi@ing the (uclidean distance was used when matching an unknown speaker with the speaker database. -uring this pro%ect, we have found out that the "S based clustering approach provides us with the faster speaker identification process. /efore accept this pro%ect, know nothing about neural network. 3hat can it do or dont. /ut with the nternet technology can find a lot of information about it. Speech recognition is a very popular problem that researchers have intensively been investigation for the last two decades.

JEP, FKEE (Semester 1 2012/2013)

07

#a%u&'( of E&e%'r)%a& a*d E&e%'ro*)% E*+)*eer)*+ T !"# $ NEURAL NETWOR% FOR FA&E RE&OGNITION A'#()'#(! D*!# 13+04+2013

)eferences
1. Mike . ! "ei ". 2##$% &'peaker (ecognition)% available at* http*++cslu.cse.ogi.edu+,-.surve/+ch1node$0.html 1vie2ed on 13th 'ept 2##45 2. &'peech 6roduction) 7vailable at* http*++222.ise.canberra.edu.au+un018#+"eek#$6art2.htm 1vie2ed on 13th 'ept 2##45 3. "ikipedia% &"indo2 function)% available at* http*++en.2ikipedia.org+2iki+"indo29function 1vie2ed on 23th sept 2##45 $. "ikipedia% &:epstral :oncept)% available at* http*++en.2ikipedia.org+2iki+:epstrum;:epstral9concepts 1vie2ed on 23th sept 2##45 3. &feature e<traction techni=ue)% available at* http*++222.lsv.uni-saarland.de+dsp9ss#39chap8.pdf 1vei2ed on 20th sept 2##45 4. "ikipedia% &>uclidean distance)% available at* http*++en.2ikipedia.org+2iki+>uclidean9distance 1vie2ed on 10th ?ct 2##45 0. &@-Means least-s=uares partitioning method) available at* http*++222.bio.umontreal.ca+casgrain+en+labo+k-means.html 1vie2d on 3th ?ct 2##45 A. &.he k-means algorithm) available at* http*++222.cs.tu-bs.de+rob+lehre+bv+@means+@means.html 1vie2ed on 0th oct 2##45 8. &@ means 7nal/sis) available at* http*++222.clustan.com+k-means9anal/sis.html 1vie2ed on 14th oct5 1#. "ikipedia &Mel fre=uenc/ cepstral coefficient) available at* http*++en.2ikipedia.org+2iki+Mel9fre=uenc/9cepstral9coefficient 1vie2ed on 13th oct 2##45 11. "ikipedia &Mel 'cale) available at* http*++en.2ikipedia.org+2iki+Mel9scale 1vie2ed on 1A th oct 2##45

JEP, FKEE (Semester 1 2012/2013)

0:

Das könnte Ihnen auch gefallen