Sie sind auf Seite 1von 9

ARBA MINCH UNIVERSITY

INSTITUE OF TECHNOLOGY
INFORMATION TECHNOLOGY DEPARTMENT
Project Title: - Develop Speaker and Text Dependent Isolated Speech
Recognizer System

Name of scholars ID

Tewodros Demise PRAMIT/2022/10

Tabor wegi PRAMIT/2000/10

Amanuel Debena PRAMIT/1868/10

Meseret Humine PRAMIT/1968/10

Submitted to: Dr.-Ing. Abiot Sinamo

Date: - February 2019

Arba Minch, Ethiopia

Speech recognition Page | i


1. Introduction

Speech recognition is the inter-disciplinary sub-field of computational linguistics that develops


methodologies and technologies that enables the recognition and translation of spoken language
into text by computers. It is also known as automatic speech recognition, computer speech
recognition or speech to text. Speech recognition applications are becoming more and more useful
nowadays. Various interactive speech aware applications are available in the market. But they are
usually meant for and executed on the traditional general-purpose computers. Speech recognition
systems emerge as efficient alternatives for such devices where typing becomes difficult attributed
to their small screen limitations.

Speech Recognition is the process in which certain words of a particular speaker will automatically
recognized that are based on the information included in individual speech waves. The definition
of speech recognition according to Macmillan Dictionary is “a system where you speak to a
computer to make it do things, for example instead of using a keyboard”. While the definition is
true, as the area of artificial intelligence moves forward the applications for speech recognition has
rocketed. To be able to communicate with devices in a natural way, we need speech recognition.

This, of course, makes it necessary to have great accuracy, fast speed and the ability to recognize
many different speakers. The use of speech recognition is increasing rapidly and is now available in
smart TVs, desktop computers, every new smart phone, etc. allowing us to talk to computers
naturally. With the use in home appliances, education and even in surgical procedures accuracy
and speed becomes very important. A speech recognition (SR) system can basically be either
speaker-dependent or speaker independent. A speaker dependent system is intended to be used by
a single speaker and is therefore trained to understand one particular speech pattern. A speaker-
independent system is intended for use by any speaker and is naturally more difficult to achieve.
These systems tend to have 3 to 5 times higher error rates than speaker-dependent systems.

To understand SR, one should understand the components of human speech. A phoneme is defined
as the smallest unit of speech that distinguishes a meaning. Every language has a set number of
phonemes which will sound different depending on accents, dialects and physiology. When
phonemes are considered in SR, they can be considered in their acoustic context, making them
sound different, i.e. when also considering the phoneme to the left or right of the phoneme we’re

Speech recognition Page | 1


trying to interpret we call them bi phones. When considering both left and right context we call
them triphones. Continuous speech is complicated because when we speak, as a particular
articulatory gesture is being produced the next one is already being anticipated and therefor
changing the sound. This phenomenon is called co-articulation, the smearing of sounds into one
another. Human speech also has variations in pitch, rhythm and intensity. Speech recognition is
the ability of a machine or program to identify words and phrases in spoken language and convert
them to a machine-readable format. Rudimentary speech recognition software has a limited
vocabulary of words and phrases, and it may only identify these if they are spoken very clearly.
More sophisticated software has the ability to accept natural speech. A speech recognition (SR)
system can basically be either speaker-dependent or speaker independent. A speaker-dependent
system is intended to be used by a single speaker and is therefore trained to understand one
particular speech pattern. A speaker-independent system is intended for use by any speaker and is
naturally more difficult to achieve.

Speaker dependence vs. independence

A speaker-dependent system, depending on training and speaker, is usually more accurate than the
speaker-independent system. There are also multi-speaker systems that are intended to be used by
a small group of people and speaker-adaptive systems that learn to understand any speaker given
a small amount of speech data for training.

Isolated, discontinuous, or continuous speech

Isolated, meaning single words, and discontinuous, meaning full sentences with artificially
separated words by silence, are the easiest to recognize since the boundaries are detectable.
Continuous speech is the most difficult one to recognize because of co-articulation and unclear
boundaries, but it’s the most interesting one since it allows us to speak naturally.

Task and language constraints

The constraints can be task-dependent, accepting only relevant sentences for the task, e.g. a ticket
purchase service rejecting” The car is blue”. Others can be semantic, rejecting” The car is sad” or
syntactic, rejecting” Car sad the is”. Constraints are represented by grammar, filtering out

Speech recognition Page | 2


unreasonable sentences and is measured with their perplexity, a number representing the grammars
branching factor, i.e. the number of words that can follow a specific word.

The speech recognition process

The common method used in automatic speech recognition systems is the probabilistic approach,
computing a score for matching spoken words with a speech signal. A speech signal corresponds
to any word or sequence of words in the vocabulary with a probability value. The score is
calculated from phonemes in the acoustic model knowing which words can follow other words
through linguistic knowledge. The word sequence with the highest score gets chosen as the
recognition result. The SR process can be divided into four consecutive steps; pre-processing,
feature extraction, decoding and post-processing.

Block diagram of a speech recognizer.

Sequence of words in the vocabulary with a probability value. The score is calculated from
phonemes in the acoustic model knowing which words can follow other words through linguistic
the common method used in automatic speech recognition systems is the probabilistic approach,
computing a score for matching spoken words with a speech signal. A speech signal corresponds
to any word or knowledge. The word sequence with the highest score gets chosen as the recognition
result. The SR process can be divided into four consecutive steps; pre-processing, feature
extraction, decoding and post-processing. Different SR systems have different implementations of
each step and in between them, the following is the basic one and selected for this project.

Speech recognition Page | 3


Pre-processing

Wake-up-word (WUW) recognition system follows the generic functions depicted and Speech
signal captured by the microphone is converted into an electrical signal that is digitized prior to
being processed by the WUW recognition system. The system also can read digitized raw
waveform stored in a file. In either case raw waveform samples are converted into feature vectors
by the front-end with a rate of 100 feature vectors per second – defining the frame rate of the
system. Those feature vectors are used by the Voice Activity Detector (VAD) to classify each
frame (i.e., feature vector) as containing speech or no-speech defining the VAD state. The state of
the VAD is useful to reduce computational load of the recognition engine contained in the back-
end. Backend reports recognition score for each token (e.g., word) matched against a WUW model.
Is the recording of speech with a sampling frequency of, for example, 16 kHz and, according to
The Shannon Theorem, a bandwidth limited signal can be reconstructed if the sampling frequency
is more than double the maximum frequency meaning that frequencies up to almost 8 kHz are
constituted correctly.

Input to the system can be done via a microphone (live-input) or through a pre-digitized sound file.
In either case the resulting input to the feature extraction unit, depicted as Front-End, is digital
sound and feature Extraction of the Front-End. Feature extraction is a procedure that concentrates
information from the voice flag that is one of a kind for every speaker. Feature Extraction is
accomplished using standard algorithm for Mel Scale Frequency Cepstral Coefficients (MFCC).
Features are used for recognition only when the VAD state is on. The result of feature extraction
is a small number of coefficients that are passed onto a pattern.

Block diagram of a WUW speech recognizer

Speech recognition Page | 4


Signal Decoding

In the decoding process is where calculations are made to find which sequence of words that is
the most probable match to the feature vectors. For this step to work, three things have to be
present; an acoustic model with a hidden Markov model (HMM) for each unit (phoneme or word),
a dictionary containing possible words and their phoneme sequences and a language model with
words or word sequences likelihoods. The purpose of the voice activity detector (VAD) is to
reliably detect the presence or absence of speech. This tells the front-end application, and thus
correspondingly the backend, when and when not to process speech. The way in which this is
typically done is by measuring the signal energy at any given moment. When the signal energy is
very low, it suggests that no word is being spoken. If the signal energy spikes and stays at a high
level for a considerable period of time, a word is most likely being spoken. Therefore, the VAD
searches for extreme changes in the signal energy, and if the signal energy stays high for a certain
amount of time, the Voice Activity Detector will go back and mark the point at which the energy
changed dramatically.

Post Processing

SR systems usually, in the post-processing step, attempts to re-score this list, e.g. by using
a higher-order language model and/or pronouncing models. The simplest way to recognize a
delineated word token is to compare it against a number of stored word templates and determine
which model gives the “best match”. This goal is complicated by a number of factors. First,
different samples of a given word will have somewhat different durations. This problem can be
eliminated by simply normalizing the templates and the unknown speech so that they all have an
equal duration. However, another problem is that the rate of speech may not be constant throughout
the utterance (e.g., word); in other words, the optimal alignment between a template (model) and
the speech sample may be nonlinear. The Dynamic Time Warping (DTW) algorithm makes a
single pass through a matrix of frame scores while computing locally optimized segments of the
global alignment path.

Speech recognition Page | 5


Mel Frequency Cepstral Coefficient (MFCC)

Feature extraction is the greatest important part of the whole system. The aim of feature extraction
to decrease the data size of the speech signal earlier pattern classification or recognition. The steps
of Mel frequency Cepstral coefficients (MFCCs) calculation are: windowing, framing, Mel
frequency filtering, logarithmic function, Discrete Fourier Transform (DFT) and Discrete Cosine
Transform (DCT).

Discrete Fourier Transform (DFT) is used as the Fast Fourier Transform (FFT) algorithm. FFT
converts each frame of N samples from the time domain into the frequency domain. Th calculation
is more precise in frequency domain rather than in time domain.

Mel frequency filtering: The voice signal does not follow the linear scale and the frequency
range in FFT is so wide. It is perceptual scale that helps to simulate the way human ear works. It
corresponds to better resolution at low frequencies and less at high. Logarithmic function:

Logarithmic transformation is applied to the absolute magnitude of the coefficients obtained


after Mel - scale conversion. The absolute magnitude operation discards the phase information,
making feature extraction less sensitive to speaker dependent variations. DCT: Discrete cosine
transform (DCT) converts the Mel - filtered spectrum back into the time domain since the Mel
Frequency Cepstral Coefficients are used as the time index in recognition stage.

Hidden Markov Model Recognizer

In recognition or classification of the speech signal, there are many approaches to recognize the
test audio file. The methodologies of speech recognition are: ANN, GMM, DTW, HMM, Fuzzy
logic and various other methods. Among them, HMM techniques are widely used in many
applications than any other ones. There are four types of HMM model used in speech processing.

Speech recognition Page | 6


2. Tool Selection

For this project MATLAB is selected to integrate all of functional components of interest of
the system into a unified testing environment. MATLAB is chosen due to its ability to quickly
implement complex mathematical and algorithmic functions, as well as its unique ability to
visually display results through the use of image plots and other such graphs. Also, we were
able to develop a GUI in MATLAB to use as the command and control interface for all of our
test components. At the core of our testing environment is the backend pattern matching
algorithm. One of the goals of the presented testing environment is to research the effectiveness
of the back-end algorithm; more specifically, an implementation of Dynamic Time Warping
(DTW). The algorithm is used to perform speech recognition of a series of words against a
speech model.

Speech recognition Page | 7


Referemces

1. A review on speech recognition technique. Gaikwad, Santosh K and Gawali, Bharti W and
Yannawar, Pravin. 3, s.l. : International Journal of Computer Applications, 244 5 th Avenue, \
# 1526, New York, NY 10001, USA India, 2010, Vol. 10.

2. Allawadi, Nishant. Speech-to -Text System for Phonebook Automation. THAPAR : THAPAR
UNIVERSITY PATIALA, 2012.

3. A Review on Different Approaches for Speech Recognition System. Saksamudre, Suman K


and Shrishrimal, PP and Deshmukh, RR. 22, s.l. : Foundation of Computer Science, 2015,
Vol. 115.

4. Emotion Recognition through Speech Using Gaussian Mixture Model and Support Vector
Machine. Utane, Akshay S and Nalbalwar, SL. 2013, Vol. 2.

5. Speech recognition using hidden markov models. Paul, Douglas B. 1990, Vol. 3.

6. F. T. Veton Kepuska, "A MATLAB TOOL FOR SPEECH PROCESSING, ANALYSIS AND
RECOGNITION: SAR-LAB," American Society for Engineering Education, 2006.

7. . K. S. ASEEM SAXENA, "SPEECH RECOGNITION USING MATLAB," International


Journal of Advances In Computer Science and Cloud Computing, Vols. Volume- 1, no. Issue- 2,
Nov-2013.

Speech recognition Page | 8

Das könnte Ihnen auch gefallen