Sie sind auf Seite 1von 62

2

Acknowledgment
We would like to thank our supervisors Mr. Kamel KHENISSI for his valuable support to make this work without forgetting to thank Ms. Wiem FRADI for linguistic revision of our report.

Ahmed BAHRI Moemen MANSOURI

Abstract
"The voice is the technology of tomorrow" is the affirmation of such specialists of the giants Microsoft and IBM. In North America, where it is ahead of decades compared to the rest of the world, speech technology is becoming the most natural mode of interaction with the machine: Windows7, the flagship product of Microsoft, is an excellent example. The maturity of the technology of synthesis and voice recognition has led researchers to the realization of an old dream: "the understanding of spontaneous speech in the machine". The work developed during this project consists in the creation of an application to manipulate vocally MySQL data base utility. The project focused on finding vocabularies which allow the manipulation of MySQL data base including the vocal manipulation of SKYPE an Mozilla Firefox. In automatic speech recognition, we used a specific configuration of the framework used in this project with the aim of having the best result.

Table of contents
4

List of figures
5

List of tables
6

Glossary

GENERAL INTRODUCTION
Speech recognition is a technology that allows computer software to interpret a natural human language to control a well-defined system.
8

Early research in automatic speech recognition began around 40 years in the U.S. during the Cold War through the early attempts to create a machine capable of understanding human speech in order to interpret Russian intercepted messages. Now the development of speech recognition has continued to evolve, taking great importance since it became widely used by: Large firms for some of their internal applications or look in applications

based on speech recognition (Dragon Naturally Speaking, etc.) All these applications generally use their own speech engine, as there are also companies that specialize in creating and selling these engines; Voxalead example (still experimental). People with disabilities by allowing them greater autonomy. Speech recognition can also be linked to many planes of science (natural

language processing, linguistics, formal language theory, information theory, signal processing, neural networks, artificial intelligence, etc..). In fact, we can see that this technology today represents a potential market in the world of software selling because some speech recognition and PC formed an indispensable means of intellectual and social development. As part of our End of Year Project, we thought of creating a speech recognition system for controlling the management system MySQL databases that will facilitate its manipulation to develop other projects. To design a system of automatic speech recognition (ASR) as correct as possible, it should: Firstly to understand how the speech signal is really complex, ie know the object or observation input Secondly to define properly the task of the system, ie the constraints and expected performance. We briefly present the project, then we will expose problems with the study of what exists, then we will present the different needs and needs to improve the current system

and define the various specifications (platform, tools,...). Finally we propose a system that we deem appropriate.

CHAPTER I: PRESENTATION OF THE PROJECT

I-

Context of the project


10

Nowadays the voice technology is broadcasting across different operating systems and each day the necessity of this technology is getting enlarged continually. Within the context of our project, we applied this technology in MySQL data base utility with the aim of knowing how vocal application work. This project was created using local resources of the Private High School of Engineering and Technology ESPRIT".

II-

The choice of methodology


To better achieve the project, it is essential to establish a process aiming to help to

formalize the preliminary stages of developing a system to make this development more faithful to the client's needs. Given the number of available methods (2TUP, RUP, AGILE methods ...), the choice becomes difficult; leading a project manager was asked during a startup project: How will I organize the development teams? What are the tasks assigned to whom? How long would it take to deliver the product? How do we involve the client development to capture the needs of it?

The following table shows us the advantages and disadvantages of each methodology

11

Table1: Comparative table of design methodologies

Justification of our choice:


Given that our project is based on a well-defined development process that will determine the functional needs expected of the system until the final design and coding 2TUP has appeared the most appropriate to lead and plan the sequence of stages during this project. Two Tracks Unified Process respond to the constraints imposed continual change information of the company systems.

III Introduction to the 2TUP methodology:


Abbreviation of "Two Track Unified Process. It is a process that meets the needs of the Unified Process. The process 2TUP responds to constraints imposed by continual change information Systems Company. In this sense, it strengthens the control over the evolving capacities and correction of such systems. "Track 2" literally means that the process follows two paths or limbs. These are the "Functionnal ways"and "technical architecture", which correspond to the two axes of change imposed on the system information.

12

Figure1: Two types of constraints imposed on the information system

1-The functional limb:


This part capitalizes the knowledge of the company's business. It generally constitutes an investment in the medium and long term. The functions of the information system are in fact independent of the technologies used. This part includes the following steps: 1 - The capture of requirement needs, producing a model focused on the needs of the business users. 2 - Functional Analysis.

2-The technical limb:


It capitalizes the know-how. It is also an investment for the short and medium term. Techniques developed for the system can be in effect independently of the the functions to be performed. This part includes the following steps:
1-

Capture of technical needs. 2 - The Generic Design

3-Branch of the middle


As a result of the developments in the functional model and technical architecture, implementation of the system is to merge the results of the two limbs. This merger results in the production process of a Y-shaped This part includes the following steps: 1. Preliminary design. 2. Detailed design. 3. Coding. 4. Integration.
13

Figure2: Development Process in Y

14

CHAPTER II-THE FUNCTIONAL PART


I-Preliminary study

Figure3: Preliminary Study Schema

As the diagram above shows, the preliminary study is the first step 2TUP. It is to perform an initial identification of the functional and operational needs, mainly using the text. It prepares more formal activities to capture functional needs and capture techniques. For our project this study was achieved through the development of a specification. It examined the various systems already on the market, tried of identify the positive and negative sides through the critical part to fix our main objectives, articulate the needs and secure the modules that will maintain or improve by thereafter. Last stage of this study is the modeling of a context diagram

15

Figure 4: Functional Schema

Description of the schema 1. The speaker emits a sentence, once the sound; it is captured by a microphone. 2. The voice signal is then digitized using an analog to digital. The setting of the signal provides a fingerprint. 3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to segment the signal, identifying of the different segments is based of the phonetic and linguistic constraints. Once the analysis process is completed the recognition phase begins, in fact all the words spoken are separated by silences of duration greater than a few tenths of a second "recognition phase consists mainly of two phases:
1)

The learning curve: The speaker pronounces the whole vocabulary often several The recognition phase: The speaker stated before a word. To recognize the First, the sensor: to apprehend the phoneme physical balance, we in our case it is Second, the parameterization of forms which gives us an impression that is to say,

times to create a reference dictionary


2)

words emitted by the speaker there are three parts:


-

the microphone. A signal is transmitted to the microphone when the speaker speaks.
-

the characteristic sound (Time / Frequency / Intensity).And finally, the identification of


16

forms. A second schema is needed to better absorb all its different use cases that should be treated.

Figure 5: Operating principle of speech recognition

This diagram shows over the operating principle of recognition: a speaker pronounces a word vocabulary. Then word recognition is a typical problem of pattern recognition. Any system of pattern recognition always involves the following three parts:
-

A sensor for understanding the physical phenomenon under consideration (in our A floor-parameterization of the shapes (eg a spectrum analyzer). A floor-loaded decision to classify an unknown form in one of of the possible

case a microphone).
-

categories.

II- Capture of the Functional needs


Once the preliminary study is done we move to the next step, in which we will determine the functional needs on the left branch and the parallel technical needs on the right branch.
17

1-Functional Needs:
This project consists in designing and implementing a tool for voice manipulation of the database. First we will start by defining the actors who will interact with the system. Considering the need for our application, it appears that the main actors are reduced to an administrator and a user. The administrator responsible of database creation and maintenance of the accounts of the individual users using the database, all these tasks must be completed only by his voice. The user can access the database through his natural voice, after authentication we can fulfill the request LDD.

Figure 6: Capture of the functional needs

This project consists in designing and implementing a tool for voice manipulation of the database. First we will start by defining the actors who will interact with the system. Considering the need for our application, it appears that the main actors are reduced to an administrator and a user. The administrator responsible of database creation and maintenance of the accounts of the individual users using the database, all these tasks must be completed only by his voice. The user can access the database through his natural voice, after authentication we can fulfill the request LDD.

1.1-The use case diagram:


The use case diagram reflects the principle of the overall functioning of our application and the various actions of the actors. The study of the needs of the actors who interact with our system requires the development of use cases as follows:
18

Figure 7: Use case diagram

We consider that in our system two users are possible: The administrator and the normal user. The administrator accesses to all the existing use cases including that of manipulating the database, so that the user can only manipulate the database once created. a) Voice Authentication: The user pronounces his login and password in authenticate to access the main interface of MySQL. The following table details the process of authentication

Title Intention Actors Preconditions MySQL available Start when Definition of Transitions Finish when Exception(s)

Authentification Authentification des utilisateurs. Users. Preconditions MySQL available Application is launched. - pronounce the login and password. The administrator or the user's choice validates the session and connects Invalid user name. Invalid password. MySQL not found.
19

Postconditions

MySQL Menu
Table2: nominal scenario of the use case "Voice Authentication"

b) Manipulate of vocally database: Title: Manipulating the database Summarizing: The user can create, modify or delete one or more databases and create and execute queries LMD vocally. Actors: the application user.

The table details the process of the vocal manipulating the base. Title Intention Actors Preconditions Start when Definition of Transitions Vocal manipulation of the date base Creating and manipulating a database in MySQL User of the application Authentication succeeded the main window of MySQL opens. CASE 1: The user wants to create a database: -pronounced "create new schema". -Say the name of the database to create. - Confirm selection. CASE 2: The user wants to delete of database:
20

Finish when Exception(s)

-Say the name of of table to drop. - pronounced "drop database". - Confirm selection. CASE 3: The user wants to create a new table in the database: -Select of database -Be 'create new table. " -Say the name of the table to create. - Confirm selection. CASE 4: The user wants to create a query LMD: - pronounce the name of of table. -Deliver the application to create. - Confirm selection. CASE 5: The user wants to execute a query LMD: - Say the name of the table. -Say the word "execute" NB: for this case the user must write the LMD querry The user confirms his choice. - the name of the database or table already exists. -Syntax Error in SQL.
Table3: nominal scenario of the use case "Vocal Manipulation"

c) Manage vocally users: Title: Managing users Summary: The Administrator can create, modify or delete a user account Actors: Administrator application.

The table details the process of managing users. Title Intention Actors Preconditions Start when Definition of Transitions Manage the User Creation, modification or deleting of the user account Administrator Authentification en tant que administrateur russi. The interface MySQL Administrator opens CASE 1: The user wants to create a new user: - Say "user administration ". -Say "add new user". - Say the name and password. -Say "apply changes " CASE 2: The user wants to delete a user:
21

Finish when Exception(s)

-Say "user administration ". - pronounce the name of the user. - pronounced "drop user". - pronounced "ok" CASE 3: The user wants to create a clone user: -Say "user administration ". - pronounce the name of the user. - pronounced "clone user". - Give the name and password of the user decision -ok. The administrator completes The session disconnects Invalid user name. Invalid password.
Table4: nominal scenario of the use case "Manage user"

1.2- The Activity diagram


1. User pronounce a sentence through a microphone. 2. The voice signal is then analyzed using the model to achieve an Acoustic signal. 3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to segment the signal. 4. Segmented signal is compared with the database (dictionary) thanks to the search graph. 5. Projection of the action on screen.

22

Figure 8: Activity Diagram

1. User pronounce a sentence speaker emits a sentence through a microphone. 2. The voice signal is then analyzed using the model to achieve an signal. 3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to segment the signal. 4. Segmented signal is compared with the database (dictionary) thanks to the search graph. 5. Projection of the action on screen.

2 - Non-Functional Needs:
Besides the functional needs developed above, we must consider the following constraints: -The service quality of the application: -Ergonomics of the application: -The interfaces of our application must be clear
23

-The response time of the application should be minimal.

III-Functional Analysis

Figure 9: The functional analysis

1 - Cutting into categories


It consists of 1) Divide the class into categories Candidate 2) Elaborate preliminary class diagrams by categories 3) Decide the dependencies between categories

24

Figure 10: Cutting into categories

1.1- The packages Diagram:


The package diagram is a graphical representation of the relationships between the packages from the speech recognition system

25

Figure 11: Packaging Diagram

The package and more general overall is the "general media treatment" decomposed of two packages which are: "Sound treatment " and "Speech treatment " - The Sound is the wave which it's audible to ear for the humain. - The Speech Is the process of stretching and relaxing Vocal cords to Produce sound. The package that we will interests on is the "Speech treatment " which in turn is divided into two sub packages that are "Speech Synthesis " and "Speech recognition ". Our development is based on "Speech recognition".

26

2 - Development of the static model

Figure 12: nominal scenario of the use case "Voice Authentication"

2.1-Diagram of Classes:
The different classes are

Caller ou Appelant Instruction Language Instruction Recording ou Enregistrement Speech Recognizer ou Reconnaissance Vocale Feature ou Caractristique Feature Extraction ou Extraction de Caractristique
27

Feature Classification ou Classification de Caractristique Feature Matching ou Correspondance de Caractristique Code book ou Dictionnaire Action.

The various relationships are : Listen Record Send Speech Signal Perform Search And Match Contain

Figure 13: Model participating Class Diagram

28

2.2-Description of class diagram


Class "Caller" has a relationship "Listen" with another class "Instruction ". The caller can listen to a type of instruction, which is "Language Instruction". Then of class "Caller"is associated with a 'Record' with the class "Recording". This class it's then associated with a 'Send Speech Signal "with class" Speech Recognizer ". The class 'Speech Recognizer " is then associated with of class" Feature "in a relationship" Perform ", which means that the class' Speech Recognizer" contact of class "Feature" in feature extraction, classification of the features and feature matching. However, the class "Feature Matching" is associated with the relationship "Search and Match" with class " code book" to match the input speech, and its associated to the class Action with the relation contain.

Note: To ensure clarity of the diagram we preferred not to put the attributes and class
methods. We defined physical model needs to consist of 3 classes that facilitate of implementation of our application.

3 - Development of dynamic model:


The development of the dynamic model is the third activity of the analysis stage. It is situated on of left branch of the cycle Y. This is an iterative activity, strongly coupled with the activity of static modeling, described above. The development of dynamic model precedes the preliminary design.

29

Figure 13: The dynamique model

3.1- Sequence diagram:


The sequence diagram is mainly used to show interactions between the 6 categories listed / objects in of previous section. However, this interaction is in a sequential order that interactions take place. The figure shows the sequence diagram system, including the classes / objects, lifelines, processes and interactions. The interactions between the seven classes / objects are numbered 1 through 11 sequentially, which indicates which process should be done first to implement the following process.

30

Figure 14: object sequence diagram of a speech recognition

Note: The nominal scenario for voice recognition has been represented by details in the sequence diagram above (see Figure 14), in the following sequence diagrams we chose to group the class instances related to the recognition of a proceeding that is "Recognition System" according to the diagram package described later (see diagram package Figure 11).

31

:Recogniti on system

:Data base utility

:MySQl data base

Administrat or Pronounciation of login and password


Verification of grammar which countain login and psw

ref

connection

alt

Valid grammar of paramet er connection Insertion of the login and psw in the f ield of text Verification of login and psw

Invalid grammar demand of parameter of connection

alt valid parameter of connection Displaying MySQl administrator interface Response

Invalid parametre Demand of parameter of connection

Response

Figure 15: sequence diagram representing the connection

32

:Recognition system Administrator ref connection

:Data base utility

:MySQl data base

ouncing of the control privilege user command Verification of grammar

alt

Valid grammar manipulation the mouse to insert privilege addition of privilege

Demand to pronounce again the command

Figure 16: sequence diagram representing the affection a privileged user

33

:Recognition system Administrator ref connection

:Data base utility

:MySQl data base

pronouciation of the creation users command Verification of grammar

alt

Valid grammar manipulation the mouse to insert new user addition of new user

Demand to pronounce again the command

Opening the new user information interface pronouciation of the login inf ormat ion

Verification grammar

alt

Valid grammar Insertion of the login information in the field of text attribute the login information

invalid grammar Demand to pronounce again the command

Figure 17: sequence diagram addition a user

34

3.2- Diagram of States transitions


Now that the scenarios were formalized, the knowledge of all interactions between objects allows representing business rules system dynamics. However, it should focus on class behavior richest precisely in order to develop some of these dynamic rules. It uses this concept of finite state machine, which involves tracking the life cycle of a generic object of a particular class over its interactions with the rest of the world, in all possible cases. The local view of an object, describing how he reacts to events based on its current state and moves into a new state, it's plotted as a state diagram.

Figure 18: sequence diagram of states transitions

35

4-Confrontation between the static and the dynamic models:

The various relationships that exist between the main concepts of the static model (object, class, association, attribute and operation) and the main dynamic concepts (message, event, state and activity). The matches are far from trivial, because it is indeed complementary points of view and not redundant. Try of synthesize the most important, without being exhaustive: a message can be an operation invocation on an object (the receiver) by another object (the issuer); An event or effect on a transition may correspond to the call of an operation; An activity in one state may affect the performance of a complex transaction or a series of operations; A diagram of interactions involves objects (or roles); An operation can be described by an interaction diagram or activity; A guard condition and a change event attributes can view links or static; An effect on a transition can handle attributes or static links; The setting of a message can be an attribute or an entire object.

36

Chapter III-The Technical part


I-Capture of the technical requirements:
The capture of requirements, which identifies all the constraints on the choice of dimensioning system design. Tools and equipment selected, thus that taking into account the constraints of integration with the existing (pre-requisite technical architecture).

Figure 19: Capture of the technical requirements

Part of the work consisted in the study the functioning of speech recognition systems, to attach then to develop an acoustic model allowing the recognize words. This is why, we propose a first step which we introduce the Hidden Markov Models, mathematical concept that will allow to discuss the layout of the operating systems of automatic speech recognition (ASR). And in a second step, we will apply this model in our project.

1-The Hidden Markov Models:


37

Definition : A Markov process is a discrete time system which is constantly in a state taken from N distinct states. The transitions between states occurs between two consecutive discrete instants, according to some probability law. The probability of each state depends only on the condition that immediately precedes it. A hidden Markov model (HMM) represents as the same way as a Markov chain, a whole sequence of observations whose state of each observation is not observed, but associated with a probability density function . It is therefore a stochastic process in which observations are a random function of the state and whose state changes every moment according to the probabilities of transition from the previous state.

Figure 20: The Markov model

More formally, a state machine hidden Markov is characterized by a quadruplet assemblies described below: - If the state of i - i is the propability of If be the initiale state -aij is the propability of transition If If -bi(k) probability of emitting the symbol k being in the state If.
On condition that :

- The sum of the probabilities of initial states is equal to 1 i = 1 i - The sum of probabilities of transitions from a state is equal to 1.

- The sum of probabilities of outputs from a state is equal to 1


38

We can describe a hidden Markov model as the parameter set = (, A, B) With : - the set of the initial probabilities - A the set of transition probabilities between states. - B the set of laws (or densities) of probabilities associated with a state.

2-The Voice recognition theory:


We use speech recognition to dial a phone number, browse through the windows on our computer, entering data into a software or dictate a letter in a word processor, the basic problem remains to the same: identify the meaning of a flow of words uttered often in a background noise more or less important. This task is made difficult not only by the deformations induced by the use of a microphone but also by a number of factors inherent to human language: - homonyms where the same sequence of sounds can correspond to several words (like the sound "s-in" in "cent" "sans" San [Francisco "sang" means bloodI ). - The local accents. - Patterns of language (as some elisions that make it difficult to separate the words: (" j'vais l'chercher") in ( je vais le cherchI).. - The speed differences between the users. - Imperfections of a microphone ... For the human ears, these factors do not usually represent difficulties. The brain plays with these deformations of speech by taking into consideration almost unconsciously, nonverbal and contextual elements that allow to eliminate ambiguities. Only by taking into account these elements surrounding the sound itself that the voice recognition software can achieve high degrees of reliability. Today software that give the best results are all based on a probabilistic approach. The aim of speech recognition is to reconstruct a sequence of words M from a recorded acoustic signal A.
39

In the statistical approach, we will consider all sequences of words M which could match the signal A. In this set of possible consequences we will then choose one which is most likely that is to say that maximizes the probability P (M / A) that M is the correct interpretation of A what is Note M= arg max P (M / A).

Figure 21: reconstruction of a sequence of words M from a recorded acoustic signal A.

Note that P (A / B) represents the probability of event A if the event B has occurred. The axiom of Bayes calculates the probability of concurrence of two events A and B by the following equalities P (A and B) = P (A / B) P (B) = P (B / A) P (A) Where P (A) is the probability that event A occurs. Thus, the axiom Bayes can rewrite the expression:

And as P (A) is a constant search for the best M, we have finally the equation:

This last equation is the key to the probabilistic approach to speech recognition. In fact, the first term P (A / M) represents the probability of observing the acoustic signal A if the sequence of words was pronounced M: it is a purely acoustic problem. The second term P (M) represents the probability that it is the sequence of words that M was pronounced: it is a linguistic problem. The above equation tells us so that we can divide the problem of speech recognition into two independent parts: we will model separately the aspects acoustic and language problems.
40

Thus, the transcript is divided into several modules feature extraction produces A using the acoustic model calculating P (A / M) and M, looking for the hypotheses that are likely associated to A. using the language model calculating P (M) to select one or more assumptions on M depending on the language knowledge The following schema illustrates the components of a transcription system.

Figure 22: The transcription systme

3-Features extraction:
The sound signal to be analyzed in the form of a wave whose intensity varies over time. The first stage of the transcription process is to extract a series of numerical values sufficiently informative on the acoustic level to decode the signal thereafter.

41

The signal may contain areas of silence, noise or music. These areas are first removed in order to have only portions of useful signal to the transcript, that is to say, those corresponding to speech. The sound signal is then segmented into what are described as breath groups, using as delimiters of silent pauses long enough (about 0.3 s). The advantage of this segmentation is to have a continuous tone of a reasonable size compared to the capabilities of model calculations of the ASR system. Later in the transcription process, the analysis is done separately for every breath. To identify changes in the signal, which generally varies rapidly over time, the group is blowing itself divided into windows of a few milliseconds of study (usually 20 or 30 ms). In order to avoid losing important information on the top or end of windows, we made sure that they overlap, which leads to extract features every 10 ms. From the signal contained in each analysis window are calculated numerical values characterizing the human voice. After this step, the signal becomes a sequence of vectors called acoustic dimension often greater than or equal to 39.

4- The Acoustic Model:


The next step is to associate the acoustic vectors, which are, as we have seen, numeric vectors, a set of assumptions of words (symbols). Referring to equation 1 of the statistical modeling, this amounts to estimate P (A / M). The techniques for calculating this value form what is called the acoustic model. The most used tool for modeling the acoustic model is the Hidden Markov Model presented above. The HMMs have indeed shown their effectiveness in practice to recognize speech. Even if they have some limitations to model signal characteristics, such as the duration or length of successive acoustic observations, the HMMs offer a well-defined mathematical framework to calculate the probabilities P (A / M). Acoustic models involve three levels of HMM shown in the figure below.

42

Figure 23: The acoustic model

They look at first to recognize the types of sound, in other words to identify the phones (which sounds are pronounced by speakers and defined by specific characteristics). To do this, they model a phone by an HMM, usually 3 states representing the beginning, middle and the end. The hidden variable is then sub-phone and acoustic observations are acoustics vectors. To calculate the probabilities of observation in each state, two approaches are often considered, one based on the representation of probability densities by Gaussian el'autre based on neural networks. These different methods establish assumptions about the likelihood of phones uttered. However, the aim of acoustic models is to determine a sequence of words. Acoustic models for this purpose use a dictionary of pronunciations, making the correspondence between a word and pronunciations. As a word may be pronounced in different ways, according to his predecessor and his successor, or simply as the habits of the speaker, there may be multiple entries in the lexicon for the same word. The indications are given through the features of pronunciation phonemes.
43

The second level of HMM models the words from the HMM representing phones and lexicon of pronunciations. It comes in the form of a lexical tree initially containing all the words in the vocabulary gradually pruned as and when the phones are accepted. Since HMMs modlisent first level of phonemes, not phones, phonemes found in the dictionary pronunciations are converted into phones to recognize words. Transformation rules depending on the context of developing phoneme are then used. The third level models finally the sequence of words M in a group of breath and can then incorporate the knowledge gained from the language model on M. To establish the HMM equivalent to a word graph, the HMM corresponding to the lexical tree is duplicated each time the acoustic model makes the assumption that a new word has been recognized. The functioning of the acoustic model just described is facing a major problem: the search space of higher-level HMM is often considerable, especially if the vocabulary is important and if the breath to be analyzed contains multiple words . Algorithms from dynamic programming can effectively calculate the probabilities. These are mainly the Viterbi algorithm and the decoding stack, also called decoding A *. In addition, use is made of very regular pruning to keep only those assumptions that could be most interesting. The role of the acoustic model is thus to align the sound signal with theories of words using only acoustic indices of order. It includes in its last level modeling information about the words introduced by the language model.

5-The model language :


The language model is intended to find sequences of words most likely, in other words those that maximize the value P (M) of equation 1. If one refers to the highest level of HMM acoustic model (see previous figure), the values P (M) are the probabilities of successive words. a) probability P (M) is as follows: Functioning of a language model By placing M1N = M = m1 ... mn, where m is the word of rank i of the sequence M, the

44

The evaluation of P (M) reduces then to calculate P values (mi) and P (mi | M1i - 1) respectively which are obtained using the equalities

Where V is the vocabulary used by the ASR system, and C (mi) and C (M1i) represent the respective numbers of occurrences of the word and half of the sequence of words in the corpus M1i learning. Unfortunately, predicting the sequence of words M1i, the number of parameter P (m) and P i (mi | M1i - 1) of the language model to estimate, increases exponentially with n. In order to reduce this number, P (mi | M1i - 1) is modeled by a Ngram, that is to say, a Markov chain of order N-1 (with N> 1) using the equation: P (mi | M1i - 1), P (mi | mi - N + 1i - 1) This equation indicates that every word may be mid predicted from the N-1 preceding words. For N = 2, 3 or 4 refers respectively bi gram model, trigram or Quad gram. For N = 1, the model is said united program and returns to estimate P (mi). Generally, these are models bis grams, trigrams and quad grams that are used in language models for ASR.

6- The choice of Sphinx API :


Sphinx 4 is a speech recognizer written entirely in Java. The goals are to have a speech recognition highly flexible to equal the other commercial products and develop collaborative research centers from various universities, laboratories of Sun and HP, but also from MIT. While being highly configurable, recognition of Sphinx 4 supports including single words and phrases (use of grammar). Its architecture is scalable to enable new research and test new algorithms. The recognition quality depends directly on the quality of voice data. The latter being the information relating to their own voices. Examples are different phonemes, the individual words (vocabulary), different ways of pronunciation. More information will only be important and known by the system, the better his reaction and his choice to make. As shown in the following figure which represents its architecture, Sphinx 4 is based on 3 modules.
45

Figure 24: General architecture of the Sphinx-4

6.1- The Architecture of sphinx -4 :

Figure 25: Detailed Architecture of Sphinx-4

The main blocks are the frontend, decoder, and the linguist. The support blocks include the configuration manager and the tool blocks. The frontend takes one or more input signals and meterizes by a sequence of functions. The linguist translates into any kind of model of standard language, as well as information on the pronunciation of dictionary and structural information of one or more sets of acoustic models in a search graph. The research director in the decoder uses the featured frontend and search
46

graph of the linguist to do the actual decoding, generating results. At any time before or during the recognition process, the application can issue checks to each module, becoming a partner in the recognition process.

a) The Frontend
Front-End cuts the recorded voice into different parts and prepare them for the decoder. The aim of the Front End is to set an input signal (example, audio) into a sequence of outputs. As illustrated in Figure 26, the frontend has one or more parallel chains of replaceable communication signal processing modules called "dataprocessors. Support multiple channels allows simultaneous calculation of different types of parameters of the input signals are identical or different. This allows the creation of systems that can simultaneously decode types derived from non-voice signals.

Figure 26: Parallel chains of communicating Data Process

b) The Linguist :
The linguist generates searchgraph which is used by the decoder during the search, at the same time hiding the complexity of the generation of this graph. As the case along the Sphinx-4, the linguist is a plug-in module allows people to dynamically configure the system with different linguist implementations. A typical implementation of constructs searchgraph using the structure of the language represented by a given language model and the topological structure of AcousticModel (HMM for basic sound units used by the system). During generation of searchgraph, the linguist may also incorporate sub-word units in the contexts of arbitrary length. By allowing different implementations of the linguist to be connected to the execution, Sphinx-4 allows individuals to provide different configurations for different systems and recognition. For example, a simple numerical application recognition digits may use a single linguist who keeps the search space entirely in memory. The linguist is based around three components which are described in the following sections:
47

The language model The dictionnary The acoustic model b.1) The language model : Role : Describes what can be said in a very special context. Helps narrow the search space.

There are three kinds of language model: the simplest is used for isolated words, the second for applications based on commands and controls and the last for the current language. The model implementation language supports several types of grammars, we opted for The Grammar JSGF that supports the Java TM Speech API Grammar Format (JSGF) [20], which defines a BNF style, platform independent representation Unicode and vendor-independent grammars. b.2) The Dictionary The dictionary gives the pronunciation of words found in the languageModel. The pronunciations of the words cut into sequences of sub-word units found in the AcousticModel. Dictionary interface also supports the classification of words allows for one-year term to be in several classes. b.3) The AcousticModel The module AcousticModel provide a correspondence between a unit of speech and an HMM that can be scored against incoming characteristics provided by the Frontend.

b.4) the Search Graph The SearchGraph is the main data structure used during the decoding process. It is a directed graph where each node, called SearchState, represents either a state issue or not transmitter. States transmitters can be scored against incoming noise characteristics while nonissues are generally used to represent higher level language constructs such as words and phonemes that are not directly scored against the elements involved. The arcs between states represent possible state transitions, each with a probability representing the likelihood of transition along the arc.
48

How is built SearchGraph affects memory footprint, speed and accuracy of recognition. The modular design of Sphinx-4, however, allows different strategies to be used SearchGraphcompilation without changing other aspects of the systems. The choice between static and dynamic construction of language HMMs depends mainly on the size of vocabulary, complexity of the language model and the desired memory footprint of the system, and can be performed by the application. c) Decoder The decoder is the heart of the Sphinx 4. It was he who processes the information received from the Front-End, analyzes and compares them with the knowledge base to give a result to the application. The main role of the Sphinx-4 decoder block is to use the features of the Front End in collaboration with the linguist SearchGraph to generate hypotheses results. The block decoder includes SearchManager ins and other supporting code that simplifies the decoding process of an application. As such, the most interesting element of the block decoder is SearchManager. The decoder simply tells the SearchManager recognize a frameset features. At each step of the process, creates SearchManager results object that contains all the paths that have not reached a final state transmitter.

7- Technical use case diagram


This results in the block diagram of the overall operation of the system and the various actions of the actors. The study of the needs of actors who interact with our system requires the development of use case diagram as follows:

49

Figure 27: Technical use case diagram

The different use cases are: Listen for instructions: to capture the signal from the microphone Save the speech signal: save the signals from the microphone Analyze the speech signal: Segmenting the signal into phoneme Match the speech signal: match the signal to the data base Match the feature vector : match the characteristics of the signal analyzed in the data base Extract the feature vector: analyze the signal in the extractant carectristiques significant Classify Feature vector: entities classify the signals analyzed by category Different actors are: User The dictionary (code book)

II- The Generic Design

The generic design, which then defines the components needed to build the technical architecture. This design is completely independent of the functional aspects. It aims to standardize and reuse the same mechanisms for all systems. The technical architecture built the backbone of the system, its importance is such that it is advisable to make a prototype.
50

Figure 28: The generic conception

Software layers
Sphinx-4 has been compiled and tested on Solaris, Mac OS X, Linux and Windows. The execution, compilation and testing of Sphinx-4 require additional software. The following software must be installed on the machine: - Java SDK 5.1. http://java.sun.com. - The various libraries that make up the Sphinx-4

Exploitation and Configuration Software :

a) Implementation of the library with Eclipse

51

The implementation of Sphinx-4 in an arbitrary application is relatively easy. The first state is to create a new project (menu File - New - Project). The figure below shows how to create a new project in Eclipse.

Figure 29: Creation of a new project

The second step is to insert bookstores Sphinx-4 in the draft. For this, we make a right click on the project and we will in the project properties. It then chooses the menu "Java Build Path". Finally we click on "Add External JARs" to add the various libraries provided by Sphinx. Libraries to add are the following:

Figure30 - insert libraries in the Sphinx-4 project

js.jar. jsapi.jar (This must be created by launching the application jsapi.exe located in the lib directory of the downloaded archive). This library is used by Java among others to record sound.
52

sphinx4.jar. TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar. Only for recognition of numbers. WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz. b) Writing Grammar To perform recognition, we must write a grammar, that is to say a file describing the terms that must be recognized by the program. Grammars are used by Sphinx JSGF format (Java Speech Grammar Format must then create a file with an extension.''gram''. This file contains the grammar used by the application, that is to say the words or phrases that are potentially pronounceable. b.1) Example of grammar

Figure31 Gramar file

The file grammar above allows understanding all of the following sentences:

53

Figure32 list of sentences which we can pronounce it

This figure shows the grammar above graphically.

Figure33 Graphical Grammar structure

c)

Writing the configuration file for Sphinx

After writing the grammar file, we must create the configuration file Filename.config.xml. The easiest way is to use a configuration file to one of several demonstrations provided in the downloaded archive. This file specifies among other things used the dictionary and grammar used.

54

Figure34 XML configuration File

55

Chapter VI THE MIDDLE PART

I- the design part:

Fig35 The design part

The model design system organizes the system in components, delivering technical services and functional. This model combines the information from the right branch and left branch. It can be considered as the transformation of the analysis model by projecting the analysis classes on the software layers. The preliminary design is a delicate step because it integrates the functional analysis model in the technical architecture in order to draw the mapping of system components to be developed. Detailed design, which then examines how to make each component. The encoding step, which produces components and tests as and when the code units completed.
56

The recipe step, which is finally to validate the functionality of the developed system.

1- Detailed Design:

57

58

Fig36 The Class diagram detailed

The class diagram detailed outcome of the general class diagram (described in Part "2.1the class diagram). NB: It is noted that some classes will be transformed in the following forms: Class Codebook: will become our dictionary (database). Class Instruction: will become as grammar file
59

Class LanguageInstruction: will become as grammar file

II-

Realization part

1-Description of the applications interfaces :


In this part of the project we will goanna to show you the first exemple of our application . This interface present the home interface of all users.

Fig36 The home interface

This interface show the process of the addition of a new application

60

Fig37 The home interface

This interface show how to edit existed application.

61

Conclusion
62

This project has leads to the creation of an application for manipulating vocally some other application to see its MySQL, SKYPE. Thus, a search job on the internet and a careful study on the working tools were made to choose the most appropriate architecture for the system. Throughout this project we have done our best to improve our application but we faced on a major problem: the development of an acoustic model customized to each user of our application. Concretely, the difference between the applications present on the market as (Dragon Naturally Speaking, Speak Q, etc. ..) is the degree of perfection of the acoustic model; What may be considered as the most important task as it requests additional time beyond the deadline of our project.

63

Das könnte Ihnen auch gefallen