Sie sind auf Seite 1von 74

VOICE RESPONSE SYSTEM ABSTRACT

A voice response system is a computer system that responds to voice commands, rather than input from a keystroke or a mouse. Uses for this kind of system range from convenience to necessity to security. People who are visually or otherwise physically impaired are prime candidates for a voice response system. Because they cannot see or otherwise access a keyboard or mouse, they have no way to access a computer without a voice response system, unless they want to depend entirely on other people. Being able literally to tell a computer what to do may be a revelation for someone who ordinarily has little hope of controlling a computer. A voice response system would also come in handy for someone who is not physically impaired. With a voice response system, you wouldn't need to be very close to your computer in order to access it or give it commands. As long as you are in earshot of the PC, it can use its voice response system to accept voice commands from you in the same way that it traditionally accepts keystroke and mouse commands. The system acquires speech at run time through a microphone and processes the sampled speech to recognize the uttered text. Sphinx-4 is a speech recognition system written entirely in the Java(TM) programming language. A VRS is an intelligent system which enables the user to instruct computer to perform actions through voice commands and also form his own repository of commands and map them to appropriate actions. The recognized text will be matched to corresponding action

CONTENTS
1. Introduction
2. Voice recognition Relevance of The Project Application of voice recognition

3. Working of the Project Speech Engine JSAPI JSAPI classes and interfaces Speech Synthesis Speech Recognition Components Speech Recognition Weakness & Flaws Future of Speech Recognition JSGF Grammar Format Sphinx Speech Recognition System

4. Feasibility Study & Requirement Analysis 5. System Analysis & System Design 6. Data flow diagram Context diagram Level 1 Level 2

7. code snippets 8. results and screenshots 9. discussion 10. conclusion 11.bibliography

Chapter 1 INTRODUCTION

A VRS is an intelligent system which enables the user to instruct computer to perform actions through voice commands and also form his own repository of commands and map them to appropriate actions. A voice response system is a computer system that responds to voice commands, rather than input from a keystroke or a mouse. Uses for this kind of system range from convenience to necessity to security. People who are visually or otherwise physically impaired are prime candidates for a voice response system. Because they cannot see or otherwise access a keyboard or mouse, they have no way to access a computer without a voice response system, unless they want to depend entirely on other people. Being able literally to tell a computer what to do may be a revelation for someone who ordinarily has little hope of controlling a computer. A voice response system would also come in handy for someone who is not physically impaired. With a voice response system, you wouldn't need to be very close to your computer in order to access it or give it commands. As long as you are in earshot of the PC, it can use its voice response system to accept voice commands from you in the same way that it traditionally accepts keystroke and mouse commands.

Key points that outline the implemented idea are: VRS runs as a background process. Based on the instruction, multiple processes are created. While the background process keeps on listening to the user requirements, independent processes are continuously created in response to the input voice instruction. Voice recognition may be enabled in the processes executed on top also, but it has been avoided as it interferes with the background process.

VRS Library has been built which includes some basic commands.

1. DATA FILE - Opens list of saved file that may be given. 2. SONGS - Opens list of songs that may be played. 3. MOVIES - Opens list of movies that may b played. 4. NEWS - Read the news from given website. 5. SNAP - Opens picture.

The library may be further extended by the user for his own specific requirements. User.gram has been included in the src along with directions to add an action map for this purpose. Technologies used in implementation: Sphinx 4. JSAPI. Java Programming Language. JSGF Grammer files.

The relevence and use of each of the above has been discussed later in the document. The code has been developed in Eclipse. The paths used in mapping actions are absolute and hence system dependent.

The requirement of this project is to develop an intelligent system which: 1. is capable of taking voice input. 2. interprets the input command. 3. processes the command to map it to the action set. 4. it has an action set must contain mapping of input to the corresponding response. 5. has adaptive mechanism to handle more mappings and add it to the action set. 6. Example: voice input draw circle on the screen.

Chapter 2

VOICE RECOGNITION
The term voice recognition is sometimes used to refer to

recognition

systems that must be

trained to a particular speaker as is the case for most desktop recognition software. 1. Voice Recognition: Converts speech to text. 2. Recognizing the speaker can simplify the task of translating speech. 3. Voice Recognition targets to generalize the task without being targeted at a single speaker.

Relevance To The Project


1. Voice recognition is used to map a voice command with its corresponding action. This is brought about by converting speech to text. 2. The API used for recognizing voice is trained [by default] to understand american male accent recorded at 16kbps. 3. The program matches the input voice with the voice on which it is trained and maps it to the best possible result.

Although the idea of recognizing voice may seem fairly simple, there are a lot of real time problems. Some include: Large amount of memory is required to store voice files Noise interference reduces accuracy. Comparing our accent with the trained voice often gives rise to absurd results. Precision of the system is directly proportional to complexity of source code.

APPLICATIONS OF VOICE RECOGNITION


Health Care
In the health care domain, even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete. The services provided may be

redistributed rather than replaced. Speech recognition is used to enable deaf people to understand the spoken word via speech to text conversion, which is very helpful.

Military
Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16VISTA), the program in France on installing speech recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays. Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system.

Telephony and other domains


ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread. Despite the high level of integration with word processing in general personal computing, however, ASR in the field of document production has not seen the expected increases in use. The improvement of mobile processor speeds made feasible the speech-enabled Symbian and Windows Mobile Smartphones. Speech is used mostly as a part of User Interface, for creating pre-definedor custom speech commands. Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice Command), Nuance Communications (Nuance Voice Control), Vito Technology (VITO Voice2Go), Speereo Software (Speereo Voice Translator) and SVOX.

People with disabilities


People with disabilities can benefit from speech recognition programs. Speech recognition is especially useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involved disabilities that preclude using conventional computer input devices. In fact, people who used the keyboard a lot and developed RSI became an urgent early market for speech recognition. Speech recognition is used in deaf telephony, such as voicemail to text, relay services, and captioned telephone. Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper) can benefit from the software.

Home Automation
Luxury being the priority, such program also finds application in home automation. Home automation may include centralized control of lighting, heating, ventilation, air conditioning and other systems, to provide improved convenience, comfort, energy efficiency and security.

Transcription
Transcription in the linguistic sense is the conversion of a representation of language into another representation of language, usually in the same language but in a different form. Transcription should not be confused with translation, which in linguistics usually means converting from one language to another, such as from English to Spanish. The most common type of transcription is from a spoken-language source into text.

Chapter 3 WORKING OF THE PROJECT


In this chapter, we will cover all the elements required for the working of the project and then converge the requirements to explain the solution design implemented for the project.

Speech Engine
The Speech Engine loads a list of words to be recognized. This list of words is called a grammar. Takes input as distinct characteristics of sound - derived from the waveform and compares them with its own acoustic model. The engine searches its acoustic space, using the grammar to guide this search. It then determines which words in the grammar the audio most closely matches and returns a result.

Speech Engine

Java Speech API/JSAPI


The Java Speech API (JSAPI) is an application programming interface for crossplatform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only there are several implementations created by third parties, for example FreeTTS. The Java Speech API enables speech applications to interact with speech engines in a common, standardized, and implementation-independent manner. Speech engines from different vendors can be accessed using the Java Speech API, as long as they are JSAPI-compliant. With JSAPI, speech applications can use speech engine functionality such as selecting a specific language or a voice, as well as any required audio resources. JSAPI provides an API for both speech synthesis and speech recognition. .

The Java Speech APIs classes and interfaces

The different classes and interfaces that form the Java Speech API are grouped into the following three packages:

javax.speech: Contains classes and interfaces for a generic speech engine. javax.speech.synthesis: Contains classes and interfaces for speech synthesis. javax.speech.recognition: Contains classes and interfaces for speech recognition.

The Central class is like a factory class that all Java Speech API applications use. It provides static methods to enable the access of speech synthesis and speech recognition engines. The Engine interface encapsulates the generic operations that a Java Speech API-compliant speech engine should provide for speech applications.

Speech applications can primarily use methods to perform actions such as retrieving the properties and state of the speech engine and allocating and deallocating resources for a speech engine. In addition, the Engine interface exposes mechanisms to pause and resume the audio

stream generated or processed by the speech engine. The Engine interface is subclassed by the Synthesizer and Recognizer interfaces, which define additional speech synthesis and speech recognition functionality. The Synthesizer interface encapsulates the operations that a Java Speech API-compliant speech synthesis engine should provide for speech applications.

The Java Speech API is based on the event-handling model of AWT components. Events generated by the speech engine can be identified and handled as required. There are two ways to handle speech engine events: through the EngineListener interface or through the EngineAdapter class.

JSAPI STACK

Features:
Converts speech to text. Converts text and delivers them in various formats of speech. Supports events based on the Java event queue. Easy to implement API interoperates with multiple Java-based applications like applets and Swing applications. Interacts seamlessly with the AWT event queue. Supports annotations using JSML to improve pronunciation and naturalness in speech. Supports grammar definitions using JSGF. Ability to adapt to the language of the speaker.

Two core speech technologies are supported through the Java Speech API: speech synthesis and speech recognition

Speech synthesis
Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an applet, or a user. It is often referred to as text-to-speech technology. The major steps in producing speech from text are as follows:

Structure analysis: Processes the input text to determine where paragraphs, sentences, and other structures start and end. For most languages, punctuation and formatting data are used in this stage.

Text pre-processing: Analyzes the input text for special constructs of the language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, e-mail addresses, and many other forms. Other languages need special processing for these forms, and most languages have other specialized requirements

The remaining steps convert the spoken text to speech:

Text-to-phoneme conversion: Converts each word to phonemes. A phoneme is a basic unit of sound in a language.

Prosody analysis: Processes the sentence structure, words, and phonemes to determine the appropriate prosody for the sentence.

Waveform production: Uses the phonemes and prosody information to produce the audio waveform for each sentence. Speech synthesizers can make errors in any of the processing steps described above.

Human ears are well-tuned to detecting these errors, but careful work by developers can minimize errors and improve the speech output quality. The Java Speech API and the Java Speech API Markup Language (JSML) provide many ways for you to improve the output quality of a speech synthesizer.

Speech Recognition
Speech recognition provides computers with the ability to listen to spoken language and determine what has been said. In other words, it processes audio input containing speech by converting it to text.

Speech Recognition System Components:


With the help of microphone audio is input to the system, the pc sound card produces the equivalent digital representation of received audio.

Digitization
The process of converting the analog signal into a digital form is known as digitization, it involves the both sampling and quantization processes. Sampling is converting a continuous signal into discrete signal, while the process of approximating a continuous range of values is known as quantization.

SPEECH RECOGNITION SYSTEM

Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech .The software acoustic model breaks the words into the phonemes.

Language Model
Language modeling is used in many natural language processing applications such as speech recognition tries to capture the properties of a language and to predict the next word in the

speech sequence . The software language model compares the phonemes to words in its built in dictionary .

Speech engine
The job of speech recognition engine is to convert the input audio into text ; to accomplish this it uses all sorts of data, software algorithms and statistics. Its first operation is digitization as discussed earlier, that is to convert it into a suitable format for further processing. Once audio signal is in proper format it then searches the best match for it. It does this by considering the words it knows, once the signal is recognized It returns its corresponding text string.

The major steps of a typical speech recognizer are as follows:

Grammar design: Defines the words that may be spoken by a user and the patterns in which they may be spoken.

Signal processing: Analyzes the spectrum (the frequency) characteristics of the incoming audio.

Phoneme recognition: Compares the spectrum patterns to the patterns of the phonemes of the language being recognized.

Word recognition: Compares the sequence of likely phonemes against the words and patterns of words specified by the active grammars.

Result generation: Provides the application with information about the words the recognizer has detected in the incoming audio. The result information is always provided once recognition of a single utterance (often a sentence) is complete, but may also be provided during the recognition process. The result always indicates the recognizer's best guess of what a user said, but may also indicate alternative guesses.

A grammar is an object in the Java Speech API that indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints make recognition faster and more accurate because the recognizer does not have to check for bizarre sentences. The Java Speech API supports two basic grammar types: rule grammars and dictation grammars. These types differ in various ways, including how applications set up the grammars; the types of sentences they allow; how results are provided; the amount of computational resources required; and how they are used in application design. Rule grammars are defined by JSGF, the Java Speech Grammar Format.

Speech Recognition Workflow

Speech Recognition weakness and flaws


Besides all these advantages and benefits, yet a hundred percent perfect speech recognition system is unable to be developed. There are number of factors that can reduce the accuracy and performance of a speech recognition program. Speech recognition process is easy for a human but it is a difficult task for a machine, comparing with a human mind speech recognition programs seems less intelligent, this is due to that fact

that a human mind is God gifted thing and the capability of thinking, understanding and reacting is natural, while for a computer program it is a complicated task, first it need to understand the spoken words with respect to their meanings, and it has to create a sufficient balance between the words, noise and spaces. A human has a built in capability of filtering the noise from a speech while a machine requires training, computer requires help for separating the speech sound from the other sounds.

Few factors that are considerable in this regard are:


Homonyms: Are the words that are differently spelled and have the different meaning but acquires the same meaning, for example there their be and bee. This is a challenge for computer machine to distinguish between such types of phrases that sound alike. Overlapping speeches:

A second challenge in the process, is to understand the speech uttered by different users, current systems have a difficulty to separate simultaneous speeches form multiple users. Noise factor: The program requires hearing the words uttered by a human distinctly and clearly. Any extra sound can create interference, first you need to place system away from noisy environments and then speak clearly else the machine will confuse and will mix up the words.

The Future Of Speech Recognition:


Accuracy will become better and better. Dictation speech recognition will gradually become accepted. Greater use will be made of intelligent systems which will attempt to guess what the speaker intended to say, rather than what was actually said, as people often misspeak and make unintentional mistakes.

Microphone and sound systems will be designed to adapt more quickly to changing background noise levels, different environments, with better recognition of extraneous material to be discarded.

JSGF Grammar Format


Speech recognition systems provide computers with the ability to listen to user speech and determine what is said. Current technology does not yet support unconstrained speech recognition: the ability to listen to any speech in any context and transcribe it accurately. To achieve reasonable recognition accuracy and response time, current speech recognizers constrain what they listen for by using grammars.

The Java Speech Grammar Format (JSGF) defines a platform-independent, vendor-independent way of describing one type of grammar, a rule grammar (also known as a command and control grammar or regular grammar). It uses a textual representation that is readable and editable by both developers and computers, and can be included in Java source code. The other major grammar type, the dictation grammar, is not discussed in this document.

A rule grammar specifies the types of utterances a user might say (a spoken utterance is similar to a written sentence). For example, a simple window control grammar might listen for "open a file", "close the window", and similar commands.

What the user can say depends upon the context: is the user controlling an email application, reading a credit card number, or selecting a font? Applications know the context, so applications are responsible for providing a speech recognizer with appropriate grammars.

This document is the specification for the Java Speech Grammar Format. First, the basic naming and structural mechanisms are described. Following that, the basic components of the grammar, the grammar header and the grammar body, are described. The grammar header declares the grammar name and lists the imported rules and grammars. The grammar body defines the rules of this grammar as combinations of speakable text and references to other rules. Finally, some

simple examples of grammar declarations are provided. Grammars are used by speech recognizers to determine what the recognizer should listen for, and so describe the utterances a user may say. A Java Speech Grammar Format document starts with a self-identifying header. This header identifies that the document contains JSGF and indicates the version of JSGF being used (currentlyV1.0). JSGFV 1:0.The grammar body defines rules. Each rule is defined in a rule definition. A rule is defined once in a grammar. The order of definition of rules is not significant. Rule Name >= rule Expansion; public < rule Name >= rule Expansion;

Sphinx Speech Recognition System


Sphinx-4 is a speech recognition system written entirely in the Java(TM) programming language. Sphinx is a continuos-speech, speaker-independent recognition system with large vocabulary recognition making use of hidden Markov acoustic models(HMMs) and n-gram statistical language model.

Each component of the architecture is explained below:

Recognizer- Contains the main components of Sphinx-4, which are the front end, the linguist, and the decoder. The application interacts with the Sphinx-4 system mainly via the Recognizer.

Audio - The data to be decoded. This is audio in most systems, but it can also be configured to accept other forms of data, e.g., spectral or cepstral data.

Front End- Performs digital signal processing (DSP) on the incoming data.

Feature- The output of the front end are features, which are used for decoding in the rest of the system.

Linguist- Embodies the linguistic knowledge of the system, which are the acoustic model, the dictionary, and the language model. The linguist produces a search graph structure on which the search manager performs search using different algorithms.

Sphinx-4 Architecture
Acoustic Model- Contains a representation (often statistical) of a sound, often created by training using lots of acoustic data

Dictionary- Responsible for determining how a word is pronounced.

Language Model- Contains a representation (often statistical) of the probability of occurrence of words.

Search Graph- The graph structure produced by the linguist according to certain criteria (e.g., the grammar), using knowledge from the dictionary, the acoustic model, and the language model.

Decoder- Contains the search manager.

Search Manager- Performs search using certain algorithm used, e.g., breadth first search, best-first search, depth-first search, etc.. Also contains the feature scorer and the pruner.

Active List- A list of tokens representing all the states in the search graph that are active in the current feature frame.

Scorer- Scores the current feature frame against all the active states in the Active List.

Pruner- Prunes the active list according to certain strategies.

Result- The decoded result, which usually contains the N-best results.

Configuration Manager- loads the Sphinx-4 configuration data from an XML based file, and manages the component life cycle for objects.

The need for Sphinx 4:


Need to overcome Sphinx-3s limitations Need for flexibility in acoustic modeling Require handling of multimodal inputs With information fusion at various levels Need for more correct decoders Need for expansion of language model capabilities Facilitate the incorporation of several new online algorithms, that are currently difficult to incorporate into Sphinx-3 Need for better application interfaces

The SPHINX of the new millennium:


An open source project by Carnegie Mellon University, SUN Microsystems Inc. and MERL Written entirely in Java TM Highly modularized and flexible architecture Supports any acoustic model structure Supports most types of language models CFGs, N grams, Combinations New algorithms for obtaining word level hypotheses Multimodal inputs Flexible APIs

Recognition Issue:

Good Voice Data is the key to good recognition!


Quality of recognitionis directly related to quality of voice data As part of the Sphinx 4 project we will be developing a trainer to give us good voice data

How does a Recognizer Work?

Goal:
Audio goes in Results come out

Three application types


Isolated words Command / Control General Dictation

Front-End:
Transforms speech waveform into features used by recognition Features are sets of mel-frequency cepstrum coefficients (MFCC) MFCC model human auditory system Front-End is a set of signal processing filters Pluggable architecture

Knowledge Base:
The data that drives the decoder Consists of three sets of data: Dictionary Acoustic Model Language Model

Needs to scale between the three application types

DICTIONARY:
Maps words to pronunciations Provides word classification information (such as part-of- speech) Single word may have multiple pronunciations Pronunciations represented as phones or other units Can vary in size from a dozen words to >100,000 words

Language Model:
Describes what is likely to be spoken in a particular context Uses stochastic approach. Word transitions are defined in terms of transition probabilities Helps to constrain the search space

Acoustic Models:
Database of statistical models Each statistical model represents a ingle unit of speech such as a word or phoneme Acoustic Models are created/trained by analyzing large corpora of labeled speech Acoustic Models can be speaker dependent or speaker independent

Chapter 4 FEASIBILITY STUDY, REQUIREMENT ANALYSIS


SOFTWARE DEVELOPMENT LIFE CYCLE
Since the inception of this project all software engineering principles have been followed. This project has passed through all the stages of software development lifecycle (SDLC). A development process consist of various phases, each phase ending with a defined output. The main reason for following the SDLC process is that it breaks the problem of developing software into successfully performing a set of phases, each phase handling a different concern of software development. Object technologies lead to reuse and reuse (of program components) lead to faster software development and higher quality programs. Object oriented software is easy to maintain because its structure is inherently decoupled. In addition, object oriented systems are easier to adopt and easier to scale. The Object Oriented process moves through an evolutionary spiral that starts with customer satisfaction. It is here that the problem domain is defined and that basic problem classes are identified. Planning establishes a foundation for the Object Oriented Project plan.

FEASIBILTY STUDY
It is feasible because it is being frequently used in various areas like military, telephone, healthcare etc. It is also used by topmost industries for the recognition of their employees in their attendance process. So it is feasible and can be completed in given period. A Real-Time Voice Recognition Security System can be developed using the different algorithm.

THREE PHASES OF FEASIBILITY STUDY Technical Feasibility:


It involves determining whether or not a system can actually be constructed to solve the problem at hand. The technical issues raised during the feasibility stage of investigation are related to achievability of projects goal and possibility of completion of project.

Economical Feasibility:
This feasibility deals with the cost/benefit analysis. A number of intangible benefits like user friendliness, robustness and security were pointed out. The cost that will be incurred upon the implementation of this project would be quite nominal.

Operational Feasibility:
The developed system will be very reliable and user friendly. All the features and operations that we will implement in our project are possible to implement and thus feasible. This will facilitate easy use and adoptability of the system. With the use of menus, and proper validation required it become fully understandable to the common user and operational with the user.

STEPS INVOLVED IN THE FEASIBILITY ANALYSIS


Feasibility is carried out in the following steps: Form a project team and appoint a project leader:
First of all project management of the organization forms separate teams for independent project team comprises of one or system analyst and programmers with a project leader. The project leader is responsible for planning and managing the development activities of the system.

Starts preliminary investigation:


The system analyst of each project team starts preliminary investigation through different fact techniques.

Prepare the current system flow chart:


After preliminary investigation; the analysts prepare the system flowchart of the current system. These charts describe the general working of the system in graphical way.

Determine objective of the proposed system:


The major objectives of the proposed system are listed by each analyst and are discussed in the current system.

Describe the deficiencies of the proposed system:


On study the current system flowchart, the analysts prepare their system flowchart; the analysts prepare their system flowchart. Systems flowcharts of the proposed system are compared with of the current system.

Prepare the proposed system flow chart:


After determining the major objectives of the proposed system; the analysts prepare their system flowchart. Systems flowcharts of the proposed system are compared with of the current system.

Determining the technical feasibility:


The existing computer systems (hardware/software) of the concerned department are identified and their technical specifications are noted down. The analyst decides whether the existing systems are sufficient for the technical requirement of the proposed system or not.

Determine the operational feasibility:


After determine the economic feasibility, the analysts identify the responsible users of the system and hence determine the operational feasibility of the project.

Presentation of feasibility analysis:


During the feasibility study, the analysts also keep on the feasibility report. At the end feasibility analysis report is given to the management along the oral presentation.

Feasibility Analysis report:


Feasibility analysis report is formal document for management use and is prepared for system analyst during or after feasibility study. This report generally contains the following sections.

Covering letter:
It is formally presents the report with brief description of the project problem along with recommendation to be considered.

Table of content:
It lists the section of feasibility study report along with their page number.

Description of the existing system:


A brief description of the existing system along with the purpose and scope of the project.

System requirement:
The system requirements, which are either derived from the existing system or from the discussion with the users, are presented in this section.

Description of proposed system:


It presents a general description of the proposed system, highlighting its role in solving the problem. A description of output reports to be generated by the system is also represented in the desired formats.

Development plan:
It present a detailed plan with the starting and completion dates for different phases of SDLC. Complimentary planes also needed for hardware and software evaluation, purchase and installation.

Technical feasibility finding:


It presents the finding of technical feasibility study along with recommendation.

Costs and benefits: The detailed findings of cost and benefits analysis are presented in this section. The saving and benefits are highlighted to justify the economic feasibility of this project.

Operational feasibility finding


It presents the finding of operational feasibility along with the human resource requirements to implement the system.

REQUIREMENT ANALYSIS
A requirement is a condition or capability that must be met or possessed by a system to satisfy a contract, standard, specification or other formally imposed specification of the client. This phase ends with the Software Requirements Specifications (SRS). The SRS is a document that completely describes what the proposed software should do without describing how the software will do it.

SOFTWARE REQUIREMENTS SPECIFICATIONS


System Analysis is a technique for carrying out system requirement & project management using structured analysis for specifying both manual & automated system. In system analysis the focus is on inquiring of current organizational environment, defining the system requirement, making recommendation for system improvement and determining the feasibility of system.

Analysis Methodology:
A complete understanding of requirement is essential for success of a project. This is done by gathering information, the approach and manner in which sensitivity, commonsense and knowledge of what and when to gather and what to use in securing information. There are various tools for gathering during the phase of system analysis. The phases are:1. Familiarity with the present through available documentation, such as procedure manuals, document and their flow, interviews of user staff and on site observation. 2. Defining of decision making associated with managing the system. This is important for determining what information is required of the system conduction interview clarifies the decision point and how decision made in user area. 3. Once decision point is identified, a database may be conduct to define the information requirement. The information gathered is analyzed and documented. Discrepancies between decision system and information gathered from the information system are identified. This concludes the analysis and sets the stage for system design.

Type of Information Needed:

Organization based information deals with policies, objectives, goals and structure. User based information focuses on information requirement. Work based information addresses the work flow, method & procedure and workstation. We are interested in what happened to data through various point in system.

SYSTEM REQUIREMENTS:

SOFTWARE REQUIREMENTS: Language :Java SDK, Eclipse Front End Tool: Sphinx-4 Back End Tool : Oracle 10g for database. Operating system :Windows XP/7 Microsoft Word is used for documentation.

HARDWARE REQUIREMENTS:

Processor: PC with a Pentium IV-class processor, 600 MHz, Recommended: Pentium IV-class, 1.63 GHz. RAM : 1 GB Hard Disk Space: 20 GB on system drive, 10 GB for development environment. Microphone : Good Quality microphone.

Chapter 5 SYSTEM ANALYSIS AND SYSTEM DESIGN


Requirement analysis defines WHAT the system should do; design tells HOW to do it. This is the simplest way to defines system design. Any design has to be constantly evaluated to ensure that it meet its requirements, is practical and workable in the given environment. If there are number of alternatives, then all alternatives must be evaluated and the best possible solution must be implemented.

SYSTEM ANALYSIS
System Analysis is a term used to describe the process of calculating and analyzing facts in respect of existing operation of the prevailing situations that an effective computerized system may be designed and implemented if provided feasible. This is required in order to understand the problem that has to be solved. The problem may be of any kind like computerizing an existing system or developing an entirely new system or it can be a combination of two. Basically system analysis is used to describe the process of calculating and analyzing facts related to the existing operations of the prevailing situation, so that an effective and accurate computerized system may be designed and implemented if feasible. This is required in order to understand the problem the problem that has to be solved. To solve the problem in actual sense is not the aim of designing phase, but to see how the problem can be solved. For this the logical model of the system is required, providing the way to solve the problem and achieving the desired goal. The logical view of the system is provided to the developer and user for decision making such that developer can fee lease in designing the system.

SPECIFICATION OF PROJECT
The proposed system should have following features: 1. It should be able to store voices in .wav format. 2. It should be able to store usernames in database. 3. It should provide the option for existing and new user.

4. It should have the ability of processing voice prints. 5. It should closely match the voices. 6. It should recognize speech up to a reasonable extent. 7. It should provide proper guidance to the user to use it. 8. It should give fast results.

SYSTEM DESIGN
System Design is the technique of creating a system that takes into notice such factors such as needs, performance levels, database design, hardware specifications, and data management. It is the most important part in the development part in the development of the system, as in the design phase the developer brings into existence the proposed system the analyst through of in the analysis phase.

DESIGN CONCEPT
Software design sites at the technical kernel of software engineering and is applied regardless of the software process model that is used. After software requirements have been analyzed and specified. Software design is the first of three technical activities-designs, code generation and test-that are required to build and verify the software. Each activity transforms information in a manner that utility results in validated computer software. The design transforms the information domain model created during analysis into the data structure that will be required to implement the software. The data objects and relationship diagram and the detailed data content depicted in the data dictionary provide the basis for the design activity .As aforesaid Design is that phase of software engineering that tells all about the completion of a project or complete failure. In our project Face Recognition System we have spent maximum time on Image preprocessing & processing. Now we are ready with processed images so as to make it easier for the user to match images. Also data flow diagrams for the project has been developed. While developing this project we have gone through various angles of images. The training data base structures are well defined with complete description of images about the used. Another part which took most of our consideration is that we decided to create the user input for directly giving path of images in the

dialog box and then executing each of them. The architectural design defined the relationship between major structure elements of the software, the design patterns that can be used to achieve the requirements that have been defined for the system and the constraints that affect the way in which architectural design pattern can be applied. The interface design describes how the software communicates with in itself, with systems that interoperate with it, and with humans who use it .An interface applies a flow of information and a specific type of behavior. Design is the phase where quality is fostered in website designing. Design provides us with representations of software that can be assessed for quality. Design is the only way that we can accurately translate a customers requirement into a finished software product or systems. Website design serves as the foundation of the software support steps that follow.

Chapter 6 DATA FLOW DIAGRAM


A data flow diagram is graphical tool used to describe and analyze movement of data through a system. These are the central tool and the basis from which the other components are developed. The transformation of data from input to output, through processed, may be

described logically and independently of physical components associated with the system. These are known as the logical data flow diagrams. The physical data flow diagrams show the actual implements and movement of data between people, departments and workstations. A full description of a system actually consists of a set of data flow diagrams. Using two familiar notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each component in a DFD is labelled with a descriptive name. Process is further identified with a number that will be used for identification purpose. The development of DFDs is done in several levels. Each process in lower level diagrams can be broken down into a more detailed DFD in the next level. The lop-level diagram is often called context diagram. It consists a single process bit, which plays vital role in studying the current system. The process in the context level diagram is exploded into other process at the first level DFD. The idea behind the explosion of a process into more process is that understanding at one level of detail is exploded into greater detail at the next level. This is done until further explosion is necessary and an adequate amount of detail is described for analyst to understand the process. Larry Constantine first developed the DFD as a way of expressing system requirements in a graphical from, this lead to the modular design. A DFD is also known as a bubble Chart has the purpose of clarifying system requirements and identifying major transformations that will become programs in system design. So it is the starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles joined by data flows in the system.

DFD SYMBOLS:
In the DFD, there are four symbols 1. A square defines a source(originator) or destination of system data. 2. An arrow identifies data flow. It is the pipeline through which the information flows. 3. A circle or a bubble represents a process that transforms incoming data flow into outgoing data flows. 4. An open rectangle is a data store, data at rest or a temporary repository of data.

Process that transforms data flow.

Source or Destination of data

Data flow

Data Store

CONSTRUCTING DFD:
Several rules of thumb are used in drawing DFDs: 1. Process should be named and numbered for an easy reference. representative of the process. 2. The direction of flow is from top to bottom and from left to right. Data traditionally flow from source to the destination although they may flow back to the source. One way to indicate this is to draw long flow line back to a source. An alternative way is to repeat the Each name should be

source symbol as a destination. Since it is used more than once in the DFD it is marked with a short diagonal. 3. When a process is exploded into lower level details, they are numbered. 4. The names of data stores and destinations are written in capital letters. Process and dataflow names have the first letter of each work capitalized. 5. A DFD typically shows the minimum contents of data store. Each data store should contain all the data elements that flow in and out. Questionnaires should contain all the data elements that flow in and out. Missing interfaces redundancies and like is then accounted for often through interviews.

SAILENT FEATURES OF DFD: 1. 2. 3.


The DFD shows flow of data, not of control loops and decision are controlled considerations do not appear on a DFD. The DFD does not indicate the time factor involved in any process whether the dataflow take place daily, weekly, monthly or yearly. The sequence of events is not brought out on the DFD.

TYPES OF DATA FLOW DIAGRAMS:


1. Current Physical 2. Current Logical 3. New Logical 4. New Physical

CURRENT PHYSICAL:
In Current Physical DFD process label include the name of people or their positions or the names of computer systems that might provide some of the overall system-processing label includes an identification of the technology used to process the data. Similarly data flows and data stores are often labels with the names of the actual physical media on which data are stored such as file folders, computer files, business forms or computer tapes.

CURRENT LOGICAL:
The physical aspects at the system are removed as mush as possible so that the current system is reduced to its essence to the data and the processors that transforms them regardless of actual physical form.

NEW LOGICAL:
This is exactly like a current logical model if the user were completely happy with he user were completely happy with the functionality of the current system but had problems with how it was implemented typically through the new logical model will differ from current logical model while having additional functions, absolute function removal and inefficient flows recognized.

NEW PHYSICAL:
The new physical represents only the physical implementation of the new system.

RULES GOVERNING THE DFDS


PROCESS:
1) No process can have only outputs. 2) No process can have only inputs. If an object has only inputs than it must be a sink. 3) A process has a verb phrase label. DATA STORE: Data cannot move directly from one data store to another data store, a process must move data. from the source and place the data into data store .A data store has a noun phrase label. SOURCE OR SINK The origin and /or destination of data. 1) Data cannot move direly from a source to sink it must be moved by a process 2) A source and /or sink has a noun phrase land

DATA FLOW

1) A Data Flow has only one direction of flow between symbols. It may flow in both directions between a process and a data store to show a read before an update. The later is usually indicated however by two separate arrows since these happen at different type. 2) A join in DFD means that exactly the same data comes from any of two or more different processes data store or sink to a common location. 3) A data flow cannot go directly back to the same process it leads. There must be at least one other process that handles the data flow produce some other data flow returns the original data into the beginning process. 4) A Data flow to a data store means update (delete or change). 5) A data Flow from a data store means retrieve or use. 6) A data flow has a noun phrase label more than one data flow noun phrase can appear on a single arrow as long as all of the flows on the same arrow move together as one package.

DEVELOPING DATA-FLOW DIAGRAM


Top-Down Approach: The system designer makes "a context level DFD" or Level 0, which shows the "interaction" (data flows) between "the system" (represented by one process) and "the system environment" (represented by terminators).The system is "decomposed in lower-level DFD set of "processes, data stores, and the data flows between these processes

(Level 1)" into a

and data stores" .Each process is then decomposed into an "even-lower-level diagram containing its sub processes". This approach "then continues on the subsequent sub processes", until a necessary and sufficient level of detail is reached which is called the primitive process

DATA FLOW DIAGRAM LEVELS


Context Level Diagram:
This level shows the overall context of the system and its operating environment and shows the whole system as just one process. It does not usually show data stores, unless they are "owned" by external systems, e.g. are accessed by but not maintained by this system, however, these are often shown as external entities.

Level 1 (High Level Diagram):


This level (Level 1) shows all processes at the first level of numbering, data stores, external entities and the data flows between them. The purpose of this level is to show the major highlevel processes of the system and their interrelation. A process model will have one, and only one, level-1 diagram. A level-1 diagram must be balanced with its parent context level diagram, i.e. there must be the same external entities and the same data flows, these can be broken down to more detail in level 1.

LEVEL 2 DFD DIAGRAM:


name and identifier of higher level process shown at top of lower level diagram frame represents the boundary of the process data flow across the frame must relate to data flows at the higher level data store used by only one process are usually shown as internal to that process at the lower level

processes with no further decomposition are marked/*

Chapter 7

CODE SNIPPETS 1.RSSReader Class


package com.cvrce.projects.launcher; import java.net.URL; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import org.w3c.dom.CharacterData; import org.w3c.dom.Document; import org.w3c.dom.Element; import org.w3c.dom.Node; import org.w3c.dom.NodeList; public class RSSReader { private static RSSReader instance = null; private RSSReader() { } public static RSSReader getInstance() { if(instance == null) { instance = new RSSReader(); } return instance; } public String writeNews() { String s=new String("hello and welcome to News Reader Application. "); String newsInBrief = new String("Briefing the headlines?"); String headLines = new String("! The headlines are?"); try {

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); URL u = new URL("http://feeds.bbci.co.uk/news/world/asia/rss.xml"); // your feed url Document doc = builder.parse(u.openStream()); NodeList nodes = doc.getElementsByTagName("item"); for(int i=0;i<nodes.getLength();i++) { Element element = (Element)nodes.item(i); /*System.out.println("Title: " + getElementValue(element,"title")); System.out.println("Link: " + getElementValue(element,"link")); System.out.println("Publish Date: " + getElementValue(element,"pubDate")); System.out.println("author: " + getElementValue(element,"dc:creator")); System.out.println("comments: " + getElementValue(element,"wfw:comment")); System.out.println("description: " + getElementValue(element,"description")); System.out.println();*/ System.out.println(s); if(i==0) { newsInBrief = newsInBrief+" News? "+(i+1)+ "! "; } else { newsInBrief = newsInBrief+"The next news is? News! "+(i+1)+ "! "; } newsInBrief=newsInBrief+" !\n"+getElementValue(element,"title")+"? Now Describing the news! \n"+getElementValue(element,"description")+" !and? "; headLines=headLines+getElementValue(element,"title")+"!"; }//for //return s; }//try catch(Exception ex) { ex.printStackTrace(); } //s= headLines+"! "+newsInBrief; s=s+newsInBrief; return s; } private String getCharacterDataFromElement(Element e) {

try { Node child = e.getFirstChild(); if(child instanceof CharacterData) { CharacterData cd = (CharacterData) child; return cd.getData(); } } catch(Exception ex) { } return ""; } //private String getCharacterDataFromElement protected float getFloat(String value) { if(value != null && !value.equals("")) { return Float.parseFloat(value); } return 0; } protected String getElementValue(Element parent,String label) { return getCharacterDataFromElement((Element)parent.getElementsByTagName(label).item(0)); } /*public static void main(String[] args) { RSSReader reader = RSSReader.getInstance(); reader.writeNews(); } */ }

2.Class TaskLauncher1
package com.cvrce.projects.launcher; import java.awt.*; //import java.awt.event.*; import com.sun.speech.freetts.Voice; import com.sun.speech.freetts.VoiceManager; import com.sun.speech.freetts.audio.AudioPlayer; import java.io.*; import edu.cmu.sphinx.frontend.util.Microphone;

public class TaskLauncher1 extends Frame { static int type; //mediaType=1 for movie, =2 for song and =3 for file Frame f; TextArea t1; public TaskLauncher1() { f=new Frame("BBC News"); //setLayout(new FlowLayout()); t1=new TextArea(200,200); //t1.setSize(100, 50); f.add(t1); f.setSize(1200,700); } public Boolean launchTask(String task) { System.out.println("Launcher received : "+task); // Microphone microphone=new Microphone(); try { if(task.contains("movie")) { type=1; // microphone.stopRecording(); String s=new String("Select your movie! say? 1? for Sixth sense? 2? for Illusionist? 3? for Madagascar? 4? for shrek? and 5? for Impact"); voice1(s); //microphone.startRecording(); } else if(task.contains("song")) { type=2; String s=new String("Select your Music? say 1? for Chak de India? 2? for Give me some sun shine? 3? for iss pal? 4? for miss independent and 5? for Kaash ik din "); voice1(s);

} //Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\Low.mp3"); //Runtime.getRuntime().exec("E:\\Music\\Low.mp3"); else if(task.contains("data file")) { type=3; int i=0; String s=new String("Select whose biodata file to read? say 1? for samarpita? 2? for pranita? 3? for snigdha? and 4? for ellora green"); voice1(s); //fileread(i); } if(task.contains("one")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\Sixth_sense.avi"); } //if media type is song if(type == 2) { //play first song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\ChakDe.mp3"); } //if type is file if(type==3) fileread(1); }

//if user says two if(task.contains("two")) { //if media type is movie if(type == 1) { //play 2nd movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\The_Illusionist.avi"); } //if media type is song if(type == 2) { //play 2nd song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\3idiots04.mp3"); } //if type is file if(type==3) fileread(2); }

//if user says Three if(task.contains("three")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\madagascar2.mkv"); } //if media type is song if(type == 2)

{ //play first song Runtime.getRuntime().exec("D:\\VLC\\vlcE:\\Music\\Ispal.mp3"); } //if type is file if(type==3) fileread(3); } //if user says four if(task.contains("four")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\Shrek1.avi"); } //if media type is song if(type == 2) { //play first song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\MissIndependent.mp3"); } //if type is file if(type==3) fileread(4);

//if user says five if(task.contains("five")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc D:\\Impact.avi"); } //if media type is song if(type == 2) { //play first song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\dwnlds\\showbiz03.mp3"); } }

else if(task.contains("news")) readRSS(); else if(task.contains("snap")) Runtime.getRuntime().exec("D:\\PicasaPhotoViewerD:\\friends.jpg"); else { String s=new String(""); }

} catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace();

} return false; } public void listAllVoices() { VoiceManager voiceManager = VoiceManager.getInstance(); Voice[] voices = voiceManager.getVoices(); } public void voice1(String s) { listAllVoices(); String voiceName = "kevin16"; /* The VoiceManager manages all the voices for FreeTTS. */ VoiceManager voiceManager = VoiceManager.getInstance(); Voice helloVoice = voiceManager.getVoice(voiceName); if (helloVoice == null) { System.err.println( "Cannot find a voice named " + voiceName + ". Please specify a different voice."); System.exit(1); } /* Allocates the resources for the voice. */ helloVoice.allocate(); /* Synthesize speech. */ helloVoice.speak(s); helloVoice.deallocate();

} public void fileread(int i)throws Exception { String s1=new String(); if (i==1) { s1="D:/sambiodata.txt"; //Runtime.getRuntime().exec("D://sambiodata.txt"); } if (i==2) { s1="D:/prabiodata.txt"; //Runtime.getRuntime().exec("D://prabiodata.txt"); } if (i==3) { s1="E:/snicv.txt"; //Runtime.getRuntime().exec("E://snicv.txt"); } if (i==4) { s1="E:/ellucv.txt"; //Runtime.getRuntime().exec("E://ellucv.txt"); } FileReader fr = new FileReader(s1); BufferedReader br = new BufferedReader(fr); String s2; while((s2 = br.readLine())!= null) { System.out.println(s2); voice1(s2); } fr.close(); }

public void readRSS() { RSSReader reader = RSSReader.getInstance(); String s=reader.writeNews(); f.setVisible(true); t1.setText(s); //speak the news voice1(s); } }

3.Class VoiceResponseSystem
/* * Copyright 1999-2004 Carnegie Mellon University. * Portions Copyright 2004 Sun Microsystems, Inc. * Portions Copyright 2004 Mitsubishi Electric Research Laboratories. * All Rights Reserved. Use is subject to license terms. * * See the file "license.terms" for information on usage and * redistribution of this file, and for a DISCLAIMER OF ALL * WARRANTIES. * */ package com.cvrce.projects.speech; import com.cvrce.projects.launcher.TaskLauncher1; import com.sun.speech.freetts.Voice; import com.sun.speech.freetts.VoiceManager; import edu.cmu.sphinx.frontend.util.Microphone; import edu.cmu.sphinx.recognizer.Recognizer; import edu.cmu.sphinx.result.Result; import edu.cmu.sphinx.util.props.ConfigurationManager;

/**

* A Program showing a simple speech application built using Sphinx-4. This application uses the Sphinx-4 * endpointer, which automatically segments incoming audio into utterances and silences. */ public class VoiceResponseSystem { public void listAllVoices() { VoiceManager voiceManager = VoiceManager.getInstance(); Voice[] voices = voiceManager.getVoices(); } public void voice1(String s) { listAllVoices(); String voiceName = "kevin16"; System.out.println(); //System.out.println("Using voice: " + voiceName); /* The VoiceManager manages all the voices for FreeTTS. */ VoiceManager voiceManager = VoiceManager.getInstance(); Voice helloVoice = voiceManager.getVoice(voiceName); if (helloVoice == null) { System.err.println( "Cannot find a voice named " + voiceName + ". Please specify a different voice."); System.exit(1); } /* Allocates the resources for the voice. */ helloVoice.allocate();

/* Synthesize speech. */ helloVoice.speak(s); helloVoice.deallocate(); }

public static void main(String[] args) { String s1= new String("Hello and welcome to Voice response system?! select your option? " + " say movie? to watch a movie? song? to listen a song?! news? to listen news? " + "Data file? to listen the containts of biodata file? and? say snap? to view a picture?"); VoiceResponseSystem v1=new VoiceResponseSystem(); //v1.voice1(s1); ConfigurationManager cm; if (args.length > 0) { cm = new ConfigurationManager(args[0]); } else { cm = new ConfigurationManager(VoiceResponseSystem.class.getResource("vrs.config.xml")); } Recognizer recognizer = (Recognizer) cm.lookup("recognizer"); recognizer.allocate(); // start the microphone or exit if the programm if this is not possible Microphone microphone = (Microphone) cm.lookup("microphone"); if (!microphone.startRecording()) { System.out.println("Cannot start microphone."); recognizer.deallocate();

System.exit(1); } System.out.println("Ask: Song/News/Data File/Movie/Snap"); // loop the recognition until the programm exits.

while (true) { System.out.println("Start speaking.\n"); Result result = recognizer.recognize(); if (result != null) { String resultText = result.getBestFinalResultNoFiller(); System.out.println("You said: " + resultText + '\n'); TaskLauncher1 tl = new TaskLauncher1(); tl.launchTask(resultText); // microphone.stopRecording(); // recognizer.deallocate(); } else { } } } }

4.Grammar File
#JSGF V1.0; /** * JSGF Grammar for Hello World example */ grammar hello; public <greet> = ( Song | News | Data File | Movie | One | Two | Three | Four | Five | Snap );

Chapter 8 RESULT & SCREENSHOTS


After running the application it asks to choose any options by saying code allocated to each action. There are 5 actions: 1.Song 2. Snap 3.Movie 4.News 5.Data File

Output of each action is described below.

1.Selecting song Example 1: After selecting song, it asks for other options under this action, like Saying one for song Chak de India, two for Give me some sunshine etc.

Example 2:

2.Selecting Photo:
After selecting option snap it opens a picture friends.jpg as shown below.

3.Selecting movie:
After selecting movie, it asks for other options under this action, like Saying one for movie The sixth sense, two for The Illusionist etc.

Example 1: selected movie 4 : Shrek2

Example 2:

4.Selecting News:

After selecting this option, it connects to the bbc news rss feed i.e, http://feeds.bbci.co.uk/news/world/asia/rss.xml

5.Selecting a data file to read:


After selecting data file, it asks for other options under this action, like Saying one for file sambiodata.txt, two for prabiodata.txt etc.

Selected file 3:snicv.txt

Chapter 9 DISCUSSION
The modular framework of Sphinx-4 has permitted us to do some things very easily that have been traditionally difficult. The modular nature of Sphinx-4 also provides it with the ability to use modules whose implementations range from general to specific applications of an algorithm. For example, we were able to improve the runtime speed for the RM1 regression test by almost 2 orders of magnitude merely by plugging in a new Linguist and leaving the rest of the system the same. Furthermore, the modularity of Sphinx-4 also allows it to support a wide variety of tasks. For example, the various SearchManager implementations allow Sphinx-4 to efficiently support tasks that range from small vocabulary tasks implementations allow Sphinx-4 to support different tasks such as traditional CFG-based command-and-control applications in addition to applications that use stochastic language models. The modular nature of Sphinx-4 was enabled primarily by the use of the Java programming language. In particular, the ability of the Java platform to load code at run time permits simple support for the pluggable framework, and the Java programming language construct of interfaces permits separation of the framework design from the implementation. The Java platform also provides Sphinx-4 with a number of other advantages: Sphinx-4 can run on a variety of platforms without the need for recompilation The rich set of platform APIs greatly reduces coding time Built-in support for multithreading makes it simple to experiment with distributing decoding tasks across multiple threads Automatic garbage collection helps developers to concentrate on algorithm development instead of memory leaks On the downside, the Java platform can have issues with memory footprint. Also related to memory, some speech engines will directly access the platform memory directly in order to optimize the memory throughput during decoding. Direct access to the platform memory model is not permitted with the Java programming language. A common misconception people have regarding the Java programming language is that it is too slow. When developing Sphinx-4, we carefully instrumented the code to measure various aspects of the system, comparing the results to its predecessor.

Table I provides a summary showing that Sphinx-4 performs well (for both WER and RT, a lower number indicates better performance). An interesting result of this helps to demonstrate the strength of the pluggable and modular design of Sphinx-4. we were able to plug in different implementations of the Linguist and SearchManager that were optimized for the particular tasks, allowing Sphinx-4 to perform much better. Another interesting aspect of the performance study shows us that raw computing speed is not our biggest concern when it comes to RT performance. For the 2 CPU results in this table, we used a Scorer that equally divided the scoring task across the available CPUs. While the increase in speed is noticeable, it is not as dramatic as we expected. Further analysis helped us determine that only about 30 percent of the CPU time is spent doing the actual scoring of the acoustic model states. The remaining 70 percent is spent doing non-scoring activity, such as growing and pruning the ActiveList. Our results also show that the Java platforms garbage collection mechanism only accounts for 2-3 percent of the overall CPU usage.

TEST
WER
TI46(11 WORDS) TIDIGITS(11 WORDS) AN4(79 WORDS) RM1(1000 WORDS) WSJ5K(5000 WORDS) 0.168 0.549 1.192 2.739 7.174

RT
0.02 0.05 0.20 0.40 0.96

(Sphinx-4 performance.word error rate (wer) is given in percent. Real time (rt) speed is the ratio of utterance duration to the Time to decode the utterance.)

Results:
The test cases mentioned in the previous slide have been found to produce correct results given the voice is recognized correctly. However, the voice recognition is not 100 percent accurate. It may sometimes lead to frustrating results.

Known Bugs/Defects
Since the project is based on voice recognition, the accuracy while working is not very high.Sometimes, it may so happen that we speak at the loudest of our voice levels in as clear pronunciation as possible and yet the program might misunderstand what is spoken. It cannot be attributed as a bug in the project, but is for sure a defect which arises due to large number of factors. Some of these factors may be the noise interference from the environment, difference in the accent of the user and the accent on which the program is trained to understand etc.

Workaround:
While no perfect solution for this can be implemented, we can have a workaround. This is to train the program to understand accent of a specific user which will in turn result in higher accuracy.

Chapter 10 CONCLUSION
ADVANTAGES:
Able to write the text through both keyboard and voice input. Voice recognition of different notepad commands such as open save and clear. Open different windows softwares, based on voice input. Requires less consumption of time in writing text. Provide significant help for the people with disabilities. Lower operational costs.

DISADVANTAGES:
Low accuracy Not good in the noisy environment

After careful development of the Sphinx-4 framework, we created a number of differing implementations for each module in the framework. For example, the Front End implementations support MFCC, PLP, and LPC feature extraction; the Linguist implementations support a variety of language models, including CFGs, FSTs, and N-Grams; and the Decoder supports a variety of Search Manager implementations. Using the Configuration Manager, the various

implementations of the modules can be combined in various ways, supporting our claim that we have developed a flexible pluggable framework. Furthermore, the framework is performing well both in speed and accuracy when compared to its predecessors. The Sphinx-4 framework is already proving itself as being research ready, easily supporting various work as well as a specialized Linguist. We view this as only the very beginning, however, and expect Sphinx-4 to support future areas of core speech recognition research. Finally, the source code to Sphinx-4 is freely available. The license permits others to do academic and commercial research and to develop products without requiring any licensing fees. More information is available at http://cmusphinx.sourceforge.net/sphinx4. This Thesis/Project work of voice response system started with a brief introduction of the technology and its applications in different sectors. The project part of the Report was based

on software development for voice response system. In the later stage we discussed different tools for bringing that idea into practical work. After the development of the software finally it was tested and results were discussed, few deficiencies factors were brought in front. After the testing work, advantages of the software were described and suggestions for further enhancement and improvement were discussed.

Future Enhancements
This work can be taken into more detail and more work can be done on the project in order to bring modifications and additional features. The current software doesnt support a large vocabulary, the work will be done in order to accumulate more number of samples and increase the efficiency of the software. The current version of the software supports only few areas but more areas can be covered and effort will be made in this regard.

Chapter 11 BIBLIOGRAPHY
[1] S. Young, The HTK hidden Markov model toolkit: Design and philosophy, Cambridge University Engineering Department, UK, Tech. Rep. CUED/FINFENG/ TR152, Sept. 1994. [2] N. Deshmukh, A. Ganapathiraju, J. Hamaker, J. Picone, and M. Ordowski, A public domain speech-to-text system, in Proceedings of the 6th European Conference on Speech Communication and Technology, vol. 5, Budapest, Hungary, Sept. 1999, pp. 21272130. [3] X. X. Li, Y. Zhao, X. Pi, L. H. Liang, and A. V. Nefian, Audio-visual continuous speech recognition using a coupled hidden Markov model, in Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, Sept. 2002, pp. 213216. [4] K. F. Lee, H. W. Hon, and R. Reddy, An overview of the SPHINX speech recognition system, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, no. 1, pp. 3545, Jan. 1990. [5] X. Huang, F. Alleva, H. W. Hon, M. Y. Hwang, and R. Rosenfeld, The SPHINX-II speech recognition system: an overview, Computer Speech and Language, vol. 7, no. 2, pp. 137148, 1993. [6] M. K. Ravishankar, Efficient algorithms for speech recognition, PhD Thesis (CMU Technical Report CS-96-143), Carnegie Mellon University, Pittsburgh, PA, 1996. [7] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh, B. Raj, and P. Wolf, Design of the CMU Sphinx-4 decoder, in Proceedings of the 8th European Conference on Speech Communication and Technology, Geneve, Switzerland, Sept. 2003, pp. 11811184. [8] J. K. Baker, The Dragon system - an overview, in IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 23, no. 1, Feb. 1975, pp. 2429. [9] B. T. Lowerre, The Harpy speech recognition system, Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, PA, 1976. [10] J. K. Baker, Stochastic modeling for automatic speech understanding, in Speech Recognition, R. Reddy, Ed. New York: Academic Press, 1975, pp. 521542.

[11] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler, R. Stern, and E. Thayer, The 1996 HUB-4 Sphinx-3 system, in Proceedings of the DARPA Speech Recognition Workshop. Chantilly, VA: DARPA, Feb. 1997. [Online]. Available: http://www.nist.gov/speech/publications/darpa97/pdf/placewa1.pdf [12] M. Ravishankar, Some results on search complexity vs accuracy, in Proceedings of the DARPA Speech Recognition Workshop. Chantilly, VA: DARPA, Feb. 1997. [Online]. Available:http://www.nist.gov/speech/publications/darpa97/pdf/ravisha1.pdf [13] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1998. SMLI TR2004-0811 c2004 SUN MICROSYSTEMS INC. 9 [14] X. Huang, A. Acero, F. Alleva, M. Hwang, L. Jiang, and M. Mahajan, From SPHINX-II to Whisper: Making speech recognition usable, in Automatic Speech and Speaker Recognition, Advanced Topics, C. Lee, F. Soong, and K. Paliwal, Eds. Norwell, MA: Kluwer Academic Publishers, 1996. [15] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllable word recognition in continuously spoken sentences, in IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 28, no. 4, Aug. 1980. [16] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 17381752, 1990. [17] NIST. Speech recognition scoring package (score). [Online]. Available: http://www.nist.gov/speech/tools [18] G. D. Forney, The Viterbi algorithm, Proceedings of The IEEE, vol. 61, no. 3, pp. 268 278, 1973. [19] P. Kenny, R. Hollan, V. Gupta, M. Lenning, P. Mermelstein, and D. OShaugnessy, A*admissible heuristics of rapid lexical access, IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pp. 4959, Jan. 1993 . [20] Java speech API grammar format (JSGF). [Online]. Available:http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/ [21] M. Mohri, Finite-state transducers in language and speech processing, Computational Linguistics, vol. 23, no. 2, pp. 269311, 1997. [22] P. Clarkson and R. Rosenfeld, Statistical language modeling using the CMU-cambridge toolkit, in Proceedings of the 5th European Conference on Speech Communication and Technology, Rhodes, Greece, Sept. 1997. [23] Carnegie Mellon University. CMU pronouncing dictionary. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict

[24] S. J. Young, N. H. Russell, and J. H. S. Russell, Token passing: A simple conceptual model for connected speech recognition systems, Cambridge University Engineering Dept, UK, Tech. Rep. CUED/F-INFENG/TR38, 1989. [25] R. Singh, M. Warmuth, B. Raj, and P. Lamere, Classification with free energy at raised temperatures, in Proceedings of the 8th European Conference on Speech Communication and Technology, Geneve, Switzerland, Sept. 2003, pp. 17731776. [26] P. Kwok, A technique for the integration of multiple parallel feature streams in the Sphinx4 speech recognition system, Masters Thesis (Sun Labs TR-2003-0341), Harvard University, Cambridge, MA, June 2003. [27] P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett, The DARPA 1000-word resource management database for continuous speech recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 1. IEEE, 1988, pp. 651654. [28] G. R. Doddington and T. B. Schalk, Speech recognition: Turning theory to practice, IEEE Spectrum, vol. 18, no. 9, pp. 2632, Sept. 1981. [29] R. G. Leonard and G. R. Doddington, A database for speaker-independent digit recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 3. IEEE, 1984, p. 42.11. [30] J. Garofolo, E. Voorhees, C. Auzanne, V. Stanford, and B. Lund, Design and preparation of the 1996 HUB-4 broadcast news benchmark test corpora, in Proceedings of the DARPA Speech Recognition Workshop. Chantilly, Virginia: Morgan Kaufmann, Feb. 1997, pp. 1521. [31] (2003, Mar.) Sphinx-4 trainer design. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/Train%erDesign [32] J. R. Glass, A probablistic framework for segment-based speech recognition, Computer Speech and Language, vol. 17, no. 2, pp. 137152, Apr. 2003.

Das könnte Ihnen auch gefallen