Sie sind auf Seite 1von 21

Abstract:

The main objective of the paper is to show how a new voice recognition approach is used for person recognition by identifying their voice. The tool used in this approach is a mathematical model based mainly on matlab with the help of SFS(speech filing system).This project entails the design of speaker recognition code using MATLAB. Signalprocessing in the time and frequency domain yields a powerful m ethod for analysis. MATLABs built in functions for frequency domain analysis as well as its straightforward programming interface makes it an ideal tool for speech analysis projects. For the current project, experience was gained in general MATLAB programming and with the manipulation of time domain and frequency domain signals. Speech editing was performed as well as degradation of signals by the application of Gaussian noise.Background noise can be successfully removed from a signal by the application of a 3 rd order Butterworth filter.

INTRODUCTIONDevelopment of speaker identification systems began as early as the 1960s with exploration into voiceprint analysis, where characteristics of an individuals voice were thought to be able to characterize the uniqueness of an individual much like a fingerprint. The early systems had many flaws and research ensued to derive a more reliable method of predicting the correlation between two sets of speech utterances. Speaker identification research continues today under the realm of the field of digital signal processing where many advances have taken place in recent years. Speaker recognition is basically divided into twoclassification: speaker recognition and speaker identification and it is the method of automatically identify who is speaking on the basis of individual information integrated in speech waves . Speaker recognition technology is the most potential technology to create new services that will make our everyday lives more secured. Speaker recognition, i. e., a technique to automatically recognize speakers from their voices, has various applications to access control to restricted services such as access to banking, database services, shopping or voice mail, and access to secure equipments or areas where mostly required a real-time processing with high security level and as fewer burdens for users as possible. Focusing on the application of access control to restricted areas, this work limits the number of times that an user has to utter during the enrolment (registration) process to three times. During the enrolment process, speakers are asked to utter the same word or sentence three times in the same way as possible which is considered to be their personal identification voice (PIV). The PIV here includes characteristics of both word or sentence being spoken and the speakers voice. The system only accepts registered users with their registered PIV. Two fundamental issues regarding a text-dependent speaker recognition system are feature extraction and matching. The former involves finding features which can distinguish the Personal Identification Voice (PIV) of one person to another. The later is the process of recognizing users automatically using those features. For access control system, this involves identifying users and authenticating their identity. Furthermore, for the system that implies few burdens for users or few data for enrolment, the capture of feature variations of a person in different times is also a critical issue. There are two popular approaches for feature extraction. The first one is to use traditional acoustic features such as formant frequencies, pitch, energy of the registered voice, which can present physical characteristics of the speaker and the utterance . This approach is successfully used in several difficult tasks . The second approach is to use spectral representation of speech signal such as linear prediction coding [13], melfrequency cepstrum coefficients , being this approach is suitable for matching based on statistical model. The feature extraction process in the second approach, however, has a large computation cost due to the repetition of the process on a large number of small segments of the speech, and thus it is still not appropriate for real-time applications
2

Background

Voice:Voice is easy to capture and voice print is an acceptable biometric in almost all societies . Voice may be the only feasible biometric in applications requiring person recognition over a telephone. Voice is not expected to be sufficiently distinctive to permit identification of an individual from a large database. Moreover, a voice signal is typically degraded in quality by the microphone, communication channel, and digitizer characteristics. Voice is also affected by a persons health (e.g., cold), stress, emotions, and so on. Besides, some people seem to be extraordinarily skilled in mimicking others. But with recent advancement in technology and voice printing some properties have been characterised which remain independent of these factors.

Voice Biometrics
Voice biometrics, meaning speaker recognition, identification and verification technologies should never be confused with speech recognition technologies. Speech recognition technologies are capable of recognizing what a person is saying without recognizing who the person is. Applications of speech recognition for security purposes or secure transactions are therefore limited . In contrast, speaker recognition, verification and identification technologies can be used to ascertain whether the speaker is the person he or she claims to be. According to leading voice-based biometrics analyst J.Markowitz : Speaker identification is "the process of finding andattaching a speaker identity to the voice of an unknownspeaker. Automated speaker identification does this by comparing the voice with stored samples in a database of voice models."Speaker verification is "the process of determiningwhether a person is who she/he claims to be. It entails a one-to-one comparison between a newly input voiceprint (by the claimant) and the voiceprint for the claimed identity that is stored in the system." Our system should involve both speaker identification and speaker verification.

APPROACH:3

This multi faceted design project can be categorized into different sections speech editing, speech degradation, speech enhancement, pitch analysis, formant analysis and waveform comparison. The resulting discussion will be segmented based on these delineations. There are two major stages within Voice recognition: a training stage and a testing stage. Training involves teaching the system by building its dictionary, an acoustic model Voice that the system needs to recognize. In the testing stage we use acoustic models of the VOICE to recognize it using a classification algorithm. The development workflow consists of three steps: -Voice acquisition -Voice analysis -User interface development

Software Requirements :complete voice recognition system includes data preparing tools from outside sources. First we need the software record and save voice samples. Here we choose Microsoft Sound Recorder V5.1. Wave (.wav) file is voice format we use to save voice samples. The core development environment is MATLAB. MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows us to solve many technical computing problems.

ACQUIRING VOICE :For training, speech is acquired from a microphone and brought into the development environment for offline analysis. For testing, speech is continuously streamed into the environment for online processing. During the training stage, it is necessary to record repeated voice samples in the dictionary. For example, we repeat the word one many times with a pause between each utterance. Using the following MATLAB code with a standard PC sound card, we capture ten seconds of speech from a microphone input at 8000 samples per second:

Fs = 8000; % Sampling Freq (Hz) Duration = 10; % Duration (sec) y = wavrecord(Duration*Fs,Fs);


This approach works well for training data. In the testing stage, however, we need to continuously acquire and buffer speech samples, and at the same time, process the incoming speech frame by frame, or in continuous groups of samples. We use Data Acquisition Toolbox to set up continuous acquisition of the speech signal and simultaneously extract frames of data for processing.

Analyzing the Acquired Speech:We begin by developing a voice-detection algorithm that separates each sample from ambient noise. We then derive an acoustic model that gives a robust representation of each sample at the training stage. Finally, we select an appropriate classification algorithm for the testing stage.

Developing a Voice Recognition algorithm :-

The speech-detection algorithm is developed by processing the prerecorded speech frame by frame within a simple loop. The prerecorded voice is then analysed and features are extracted from it. The feature extraction helps us to configure voice into its basic properties. The various feature extraction can be given as :

We can also use pitch analysis for a given voice signal. For the project we will consider pitch analysis , MFCC and power spectrum distribution to extract features from a voice sinal and then use the GAUSSIAN MIXTURE MODEL approach to make a statical model.

Pitch analysis of a voice signal:Pitch is the representation of a sound wave frequency (the number of cycles per second). The file recorded with slower speech was found from the ordered list of speakers. Pitch analysis was conducted and relevant parameters (Low and High Pitch) were extracted. The average pitch of the entire wav file was computed and plotted . The results of pitch analysis can be used in voice recognition, where the differences in average pitch can be used to characterize a voice file. Pitch defines two parameters :
Low pitch High Pitch

The above modulation standard that the frequency from 440 to 880 Hz, depending on the octave. Low pitch is an effect in which the frequency of wave becomes down to a certain value. High pitch is an effect in which the frequency of wave becomes high to a certain value. If two .wave file have the frequency A =440 and B = 547 Hz respectively then A belong to low pitch and B belong tohigh pitch standards.

MFCC analysis for voice recognition:Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel-scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound, for example, in audio compression

MFCCs are commonly derived as follows 1. Take the Fourier transform of (a windowed excerpt of) a signal. 2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. 3. Take the logs of the powers at each of the mel frequencies. 4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal. 5. The MFCCs are the amplitudes of the resulting spectrum.

POWER SPECTRAL DENSITY :The spectral density, power spectral density (PSD), or energy spectral density (ESD), is a positive real function of a frequency variable associated with a stationary stochastic process, or a deterministic function of time, which has dimensions of power per Hz, or energy per Hz. It is often called simply the spectrum of the signal. Intuitively, the spectral density captures the frequency content of a stochastic process and helps identify periodicities.

For example PSD for 3 different sound waves having the following properties are given as

VOICE RECOGNITON:To detect voice , we use a combination of signal energy and zero-crossing counts for each speech frame. Signal energy works well for detecting voiced signals,
9

while zero-crossing counts work well for detecting unvoiced signals. Calculating these metrics is simple using core MATLAB mathematical and logical operators. To avoid identifying ambient noise as speech, we assume that each isolated word will last at least 25 milliseconds.

Developing the acousitic Model:Good acoustic model should be derived from speech characteristics that will enable the system to distinguish between the different words in the dictionary. We know that different sounds are produced by varying the shape of the human vocal tract and that these different sounds can have different frequencies. To investigate these frequency characteristics we examine the power spectral density (PSD) estimates of various spoken digits. Since the human vocal tract can be modeled as an all-pole filter, we use the Yule-Walker parametric spectral estimation technique from Signal Processing Toolbox to calculate these PSDs. After importing an utterance of a single digit into the variable speech, we use the following MATLAB code to visualize the PSD estimate: order = 12; nfft = 512; Fs = 8000; pyulear(speech,order,nfft,Fs); Since the Yule-Walker algorithm fits an autoregressive linear prediction filter model to the signal, we must specify an order of this filter. We select an arbitrary value of 12, which is typical in speech applications. From the linear predictive filter coefficients,we can obtain several feature vectors using Signal Processing Toolbox functions, including reflection coefficients, log area ratio parameters, and line spectral frequencies. One set of spectral features commonly used in speech applications because of its robustness is Mel Frequency Cepstral Coefficients (MFCCs). MFCCs give a measure of the energy within overlapping frequency bins of a spectrum with a warped (Mel) frequency scale. Since speech can be considered to be short term stationary, MFCC feature vectors are calculated for each frame of detected speech. Using many utterances of a digit and combining all the feature vectors, we can estimate a multidimensional probability density function (PDF) of the vectors for a specific digit. Repeating this process for each digit, we obtain the acoustic model for each digit. During the testing stage, we extract the MFCC vectors from the test speech and use a probabilistic measure to determine the source digit with maximum likelihood. The challenge then becomes to select an appropriate PDF to represent the MFCC feature vector distributions.

10

Figure 4a

This is the Yule walker representation of a word said by the same person for 3 different times. Figure 4a shows the distribution of the first dimension of MFCC feature vectors extracted from the training data for the digit one. We could use dfittool in Statistics Toolbox to fit a PDF, but the distribution looks quite arbitrary, and standard distributions do not provide a good fit. One solution is to fit a Gaussian mixture model (GMM), a sum of weighted Gaussians .The complete Gaussian mixture density is parameterized by the mixture weights, mean vectors, and covariance matrices from all component densities. For isolated digit recognition, each digit is represented by the parameters of its GMM. To
11

estimate the parameters of a GMM for a set of MFCC feature vectors extracted from training speech, we use an iterative expectation-maximization (EM) algorithm to obtain a maximum likelihood (ML) estimate. Given some MFCC training data in the variable MFCCtraindata, we use the Statistics Toolbox gmdistribution function to estimate the GMM parameters. This function is all that is required to perform the iterative EM calculations. %Number of Gaussian component densities M = 8; model = gmdistribution.fit(MFCCtraindata,M); Selecting a Classification Algorithm after estimating a GMM for each digit, we have a dictionary for use at the testing stage. Given some test speech, we again extract the MFCC feature vectors from each frame of the detected word. The objective is to find the voice model with the maximum a posteriori probability for the set of test feature vectors, which reduces to maximizing a log-likelihood value. Given a voice model gmmmodel and some test feature vectors Developing a GUI:A graphical user interface (GUI) is a graphical display that contains devices, or components, that enable a user to perform interactive tasks. To perform these tasks, the user of the GUI does not have to create a script or type commands at the command line. The GUI components can be menus, toolbars, push buttons, radio buttons, list boxes, and sliders - just to name a few. In MATLAB, a GUI can also display data in tabular form or as plots, and can group re lated components. The GUI for voice reconiton system can be made as shown in the following images

The above GUI shows us the main menu for the voice recognition system.the function of the buttons shown above can be described as :

12

 TRAIN Train help us to train the system to recognize a voice sameple of a given user by building a dictionary in the databse  TEST Test is used to compare a new voice sample to the sample present in the database to ensure if it belongs to the same person.  CLEAR DATABSE this is used to clear all data from the dictionary so as to feed in new samples of voice  EXIT- It exits the GUI

With the help of the above GUIs we see that a robust system for voice recognition can be developed helping us to identify a given person on the b asis of his voice.

13

APPLICATIONS : Banking  Security System  Database services  Shopping or voice mail  Access to secure equipments or areas where mostly required a real-time processing with high security level and as fewer burdens for users as possible.  Developing PIV cards for identity purposes  Developing voice controlled based home equipments

14

Advantages:-

      

Reduces human effort in various applications Provides a better solution for security system Economical Less maintainance Universally accepted Personal identification is secured Banking and other online facilities can be made free from HACKING

DISADVANTAGE:Although voice recognition solves a lot of problems but it is dependent .Voice of an individual which can vary due to factors such as cold , stress etc. Also mimicry is a big disadvantage in the case of voice recognition system. In Very loud places voice recognition can have many flaws.

15

CONCLUSION:The goal of this project was to create a speaker recognition system, and apply it to a speech of an unknown speaker. By investigating the extracted features of the unknown speech and then compare them to the stored extracted features for each different speaker in order to identify the unknown speaker. The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients). A GUI can be created to provide a interace between the machine and the user. Although this project provides a good basis for voice recognition, advancement in the field of Voice Biometrics will provide much better and accurate results

16

References :-

 www.mathworks.com  Dr. Rajesh Hegde


Asst professor Electrical Engineering Department IIT Kanpur  Wikipedia Mel frequency cepstral coefficient available http://en.wikipedia.org/wiki/Mel_frequency_cepstral_coefficient

at:

17

MATLAB CODE

 matlab code for acquiring data


% Define system parameters framesize = 80; % Framesize (samples) Fs = 8000; % Sampling Frequency (Hz) RUNNING = 1; % A flag to continue data capture % Setup data acquisition from sound card ai = analoginput('winsound'); addchannel(ai, 1); % Configure the analog input object. set(ai, 'SampleRate', Fs); set(ai, 'SamplesPerTrigger', framesize); set(ai, 'TriggerRepeat',inf); set(ai, 'TriggerType', 'immediate'); % Start acquisition start(ai) % Keep acquiring data while RUNNING ~= 0 while RUNNING % Acquire new input samples newdata = getdata(ai,ai.SamplesPerTrigger); % Do some processing on newdata <DO _ SOMETHING> % Set RUNNING to zero if we are done if <WE _ ARE _ DONE> RUNNING = 0; end end % Stop acquisition stop(ai); % Disconnect/Cleanup delete(ai); clear ai;

 Matlab code for reading the voice data frame by frame


% Define system parameters seglength = 160; % Length of frames overlap = seglength/2; % # of samples to overlap stepsize = seglength - overlap; % Frame step size nframes = length(speech)/stepsize-1; % Initialize Variables samp1 = 1; samp2 = seglength; %Initialize frame start and end
18

for i = 1:nframes % Get current frame for analysis frame = speech(samp1:samp2); % Do some analysis <DO _ SOMETHING> % Step up to next frame of speech samp1 = samp1 + stepsize; samp2 = samp2 + stepsize; end

 Matlab code for pitch analysis


%Code for pitch analysis of a wav file. This code needs the pitch.m %and pitchacorr.m files to be in the same directory. A plot of pitch %contour versus time frame is created and the average pitch of the wav %file is returned. %Author = E. Darren Ellis 05/01 [y, fs, nbits] = wavread('*.wav'); %read in the wav file [t, f0, avgF0] = pitch(y,fs) %call the pitch.m routine plot(t,f0) %plot pitch contour versus time frame avgF0 %display the average pitch sound(y) %play back the sound file %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Function: % Extract pitch information from speech files % pitch can be obtained by obtaining the peak of autocorrelation % usually the original speech file is segmented into frames % and pitch contour can be derived by plot of peaks from frames % % Input: % x: original speech % fs: sampling rate % % Output: % t: time frame % f0: pitch contour % avgF0: average pitch frequency % % Acknowledgement: % this code is based on Philipos C. Loizou's colea % function [t, f0, avgF0] = pitch(y, fs) % get the number of samples ns = length(y); % error checking on the signal level, remove the DC bias
19

mu = mean(y); y = y - mu; % use a 30msec segment, choose a segment every 20msec % that means the overlap between segments is 10msec fRate = floor(120*fs/1000); updRate = floor(110*fs/1000); nFrames = floor(ns/updRate)-1; % the pitch contour is then a 1 x nFrames vector f0 = zeros(1, nFrames); f01 = zeros(1, nFrames); % get the pitch from each segmented frame k = 1; avgF0 = 0; m = 1; for i=1:nFrames xseg = y(k:k+fRate-1); f01(i) = pitchacorr(fRate, fs, xseg); % do some median filtering, less affected by noise if i>2 & nFrames>3 z = f01(i-2:i); md = median(z); f0(i-2) = md; if md > 0 avgF0 = avgF0 + md; m = m + 1; end elseif nFrames<=3 f0(i) = a; avgF0 = avgF0 + a; m = m + 1; end k = k + updRate; end t = 1:nFrames; t = 20 * t; if m==1 avgF0 = 0; else avgF0 = avgF0/(m-1); end

 Matlab code for MFCC


f unction [cepstral] = mfcc(x,y,fs)

% Calculate mfcc's with a frequency(fs) and store in ceptral cell. Display % y at a time when x is calculated cepstral = cell(size(x,1),1);
20

for i = 1:size(x,1) disp(y(i,:)) cepstral{i} = melcepst(x{i},fs,'x'); end

21

Das könnte Ihnen auch gefallen