Beruflich Dokumente
Kultur Dokumente
A voice response system is a computer system that responds to voice commands, rather than input from a keystroke or a mouse. Uses for this kind of system range from convenience to necessity to security. People who are visually or otherwise physically impaired are prime candidates for a voice response system. Because they cannot see or otherwise access a keyboard or mouse, they have no way to access a computer without a voice response system, unless they want to depend entirely on other people. Being able literally to tell a computer what to do may be a revelation for someone who ordinarily has little hope of controlling a computer. A voice response system would also come in handy for someone who is not physically impaired. With a voice response system, you wouldn't need to be very close to your computer in order to access it or give it commands. As long as you are in earshot of the PC, it can use its voice response system to accept voice commands from you in the same way that it traditionally accepts keystroke and mouse commands. The system acquires speech at run time through a microphone and processes the sampled speech to recognize the uttered text. Sphinx-4 is a speech recognition system written entirely in the Java(TM) programming language. A VRS is an intelligent system which enables the user to instruct computer to perform actions through voice commands and also form his own repository of commands and map them to appropriate actions. The recognized text will be matched to corresponding action
CONTENTS
1. Introduction
2. Voice recognition Relevance of The Project Application of voice recognition
3. Working of the Project Speech Engine JSAPI JSAPI classes and interfaces Speech Synthesis Speech Recognition Components Speech Recognition Weakness & Flaws Future of Speech Recognition JSGF Grammar Format Sphinx Speech Recognition System
4. Feasibility Study & Requirement Analysis 5. System Analysis & System Design 6. Data flow diagram Context diagram Level 1 Level 2
Chapter 1 INTRODUCTION
A VRS is an intelligent system which enables the user to instruct computer to perform actions through voice commands and also form his own repository of commands and map them to appropriate actions. A voice response system is a computer system that responds to voice commands, rather than input from a keystroke or a mouse. Uses for this kind of system range from convenience to necessity to security. People who are visually or otherwise physically impaired are prime candidates for a voice response system. Because they cannot see or otherwise access a keyboard or mouse, they have no way to access a computer without a voice response system, unless they want to depend entirely on other people. Being able literally to tell a computer what to do may be a revelation for someone who ordinarily has little hope of controlling a computer. A voice response system would also come in handy for someone who is not physically impaired. With a voice response system, you wouldn't need to be very close to your computer in order to access it or give it commands. As long as you are in earshot of the PC, it can use its voice response system to accept voice commands from you in the same way that it traditionally accepts keystroke and mouse commands.
Key points that outline the implemented idea are: VRS runs as a background process. Based on the instruction, multiple processes are created. While the background process keeps on listening to the user requirements, independent processes are continuously created in response to the input voice instruction. Voice recognition may be enabled in the processes executed on top also, but it has been avoided as it interferes with the background process.
VRS Library has been built which includes some basic commands.
1. DATA FILE - Opens list of saved file that may be given. 2. SONGS - Opens list of songs that may be played. 3. MOVIES - Opens list of movies that may b played. 4. NEWS - Read the news from given website. 5. SNAP - Opens picture.
The library may be further extended by the user for his own specific requirements. User.gram has been included in the src along with directions to add an action map for this purpose. Technologies used in implementation: Sphinx 4. JSAPI. Java Programming Language. JSGF Grammer files.
The relevence and use of each of the above has been discussed later in the document. The code has been developed in Eclipse. The paths used in mapping actions are absolute and hence system dependent.
The requirement of this project is to develop an intelligent system which: 1. is capable of taking voice input. 2. interprets the input command. 3. processes the command to map it to the action set. 4. it has an action set must contain mapping of input to the corresponding response. 5. has adaptive mechanism to handle more mappings and add it to the action set. 6. Example: voice input draw circle on the screen.
Chapter 2
VOICE RECOGNITION
The term voice recognition is sometimes used to refer to
recognition
trained to a particular speaker as is the case for most desktop recognition software. 1. Voice Recognition: Converts speech to text. 2. Recognizing the speaker can simplify the task of translating speech. 3. Voice Recognition targets to generalize the task without being targeted at a single speaker.
Although the idea of recognizing voice may seem fairly simple, there are a lot of real time problems. Some include: Large amount of memory is required to store voice files Noise interference reduces accuracy. Comparing our accent with the trained voice often gives rise to absurd results. Precision of the system is directly proportional to complexity of source code.
redistributed rather than replaced. Speech recognition is used to enable deaf people to understand the spoken word via speech to text conversion, which is very helpful.
Military
Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16VISTA), the program in France on installing speech recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays. Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system.
Home Automation
Luxury being the priority, such program also finds application in home automation. Home automation may include centralized control of lighting, heating, ventilation, air conditioning and other systems, to provide improved convenience, comfort, energy efficiency and security.
Transcription
Transcription in the linguistic sense is the conversion of a representation of language into another representation of language, usually in the same language but in a different form. Transcription should not be confused with translation, which in linguistics usually means converting from one language to another, such as from English to Spanish. The most common type of transcription is from a spoken-language source into text.
Speech Engine
The Speech Engine loads a list of words to be recognized. This list of words is called a grammar. Takes input as distinct characteristics of sound - derived from the waveform and compares them with its own acoustic model. The engine searches its acoustic space, using the grammar to guide this search. It then determines which words in the grammar the audio most closely matches and returns a result.
Speech Engine
The different classes and interfaces that form the Java Speech API are grouped into the following three packages:
javax.speech: Contains classes and interfaces for a generic speech engine. javax.speech.synthesis: Contains classes and interfaces for speech synthesis. javax.speech.recognition: Contains classes and interfaces for speech recognition.
The Central class is like a factory class that all Java Speech API applications use. It provides static methods to enable the access of speech synthesis and speech recognition engines. The Engine interface encapsulates the generic operations that a Java Speech API-compliant speech engine should provide for speech applications.
Speech applications can primarily use methods to perform actions such as retrieving the properties and state of the speech engine and allocating and deallocating resources for a speech engine. In addition, the Engine interface exposes mechanisms to pause and resume the audio
stream generated or processed by the speech engine. The Engine interface is subclassed by the Synthesizer and Recognizer interfaces, which define additional speech synthesis and speech recognition functionality. The Synthesizer interface encapsulates the operations that a Java Speech API-compliant speech synthesis engine should provide for speech applications.
The Java Speech API is based on the event-handling model of AWT components. Events generated by the speech engine can be identified and handled as required. There are two ways to handle speech engine events: through the EngineListener interface or through the EngineAdapter class.
JSAPI STACK
Features:
Converts speech to text. Converts text and delivers them in various formats of speech. Supports events based on the Java event queue. Easy to implement API interoperates with multiple Java-based applications like applets and Swing applications. Interacts seamlessly with the AWT event queue. Supports annotations using JSML to improve pronunciation and naturalness in speech. Supports grammar definitions using JSGF. Ability to adapt to the language of the speaker.
Two core speech technologies are supported through the Java Speech API: speech synthesis and speech recognition
Speech synthesis
Speech synthesis provides the reverse process of producing synthetic speech from text generated by an application, an applet, or a user. It is often referred to as text-to-speech technology. The major steps in producing speech from text are as follows:
Structure analysis: Processes the input text to determine where paragraphs, sentences, and other structures start and end. For most languages, punctuation and formatting data are used in this stage.
Text pre-processing: Analyzes the input text for special constructs of the language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, e-mail addresses, and many other forms. Other languages need special processing for these forms, and most languages have other specialized requirements
Text-to-phoneme conversion: Converts each word to phonemes. A phoneme is a basic unit of sound in a language.
Prosody analysis: Processes the sentence structure, words, and phonemes to determine the appropriate prosody for the sentence.
Waveform production: Uses the phonemes and prosody information to produce the audio waveform for each sentence. Speech synthesizers can make errors in any of the processing steps described above.
Human ears are well-tuned to detecting these errors, but careful work by developers can minimize errors and improve the speech output quality. The Java Speech API and the Java Speech API Markup Language (JSML) provide many ways for you to improve the output quality of a speech synthesizer.
Speech Recognition
Speech recognition provides computers with the ability to listen to spoken language and determine what has been said. In other words, it processes audio input containing speech by converting it to text.
Digitization
The process of converting the analog signal into a digital form is known as digitization, it involves the both sampling and quantization processes. Sampling is converting a continuous signal into discrete signal, while the process of approximating a continuous range of values is known as quantization.
Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech .The software acoustic model breaks the words into the phonemes.
Language Model
Language modeling is used in many natural language processing applications such as speech recognition tries to capture the properties of a language and to predict the next word in the
speech sequence . The software language model compares the phonemes to words in its built in dictionary .
Speech engine
The job of speech recognition engine is to convert the input audio into text ; to accomplish this it uses all sorts of data, software algorithms and statistics. Its first operation is digitization as discussed earlier, that is to convert it into a suitable format for further processing. Once audio signal is in proper format it then searches the best match for it. It does this by considering the words it knows, once the signal is recognized It returns its corresponding text string.
Grammar design: Defines the words that may be spoken by a user and the patterns in which they may be spoken.
Signal processing: Analyzes the spectrum (the frequency) characteristics of the incoming audio.
Phoneme recognition: Compares the spectrum patterns to the patterns of the phonemes of the language being recognized.
Word recognition: Compares the sequence of likely phonemes against the words and patterns of words specified by the active grammars.
Result generation: Provides the application with information about the words the recognizer has detected in the incoming audio. The result information is always provided once recognition of a single utterance (often a sentence) is complete, but may also be provided during the recognition process. The result always indicates the recognizer's best guess of what a user said, but may also indicate alternative guesses.
A grammar is an object in the Java Speech API that indicates what words a user is expected to say and in what patterns those words may occur. Grammars are important to speech recognizers because they constrain the recognition process. These constraints make recognition faster and more accurate because the recognizer does not have to check for bizarre sentences. The Java Speech API supports two basic grammar types: rule grammars and dictation grammars. These types differ in various ways, including how applications set up the grammars; the types of sentences they allow; how results are provided; the amount of computational resources required; and how they are used in application design. Rule grammars are defined by JSGF, the Java Speech Grammar Format.
that a human mind is God gifted thing and the capability of thinking, understanding and reacting is natural, while for a computer program it is a complicated task, first it need to understand the spoken words with respect to their meanings, and it has to create a sufficient balance between the words, noise and spaces. A human has a built in capability of filtering the noise from a speech while a machine requires training, computer requires help for separating the speech sound from the other sounds.
A second challenge in the process, is to understand the speech uttered by different users, current systems have a difficulty to separate simultaneous speeches form multiple users. Noise factor: The program requires hearing the words uttered by a human distinctly and clearly. Any extra sound can create interference, first you need to place system away from noisy environments and then speak clearly else the machine will confuse and will mix up the words.
Microphone and sound systems will be designed to adapt more quickly to changing background noise levels, different environments, with better recognition of extraneous material to be discarded.
The Java Speech Grammar Format (JSGF) defines a platform-independent, vendor-independent way of describing one type of grammar, a rule grammar (also known as a command and control grammar or regular grammar). It uses a textual representation that is readable and editable by both developers and computers, and can be included in Java source code. The other major grammar type, the dictation grammar, is not discussed in this document.
A rule grammar specifies the types of utterances a user might say (a spoken utterance is similar to a written sentence). For example, a simple window control grammar might listen for "open a file", "close the window", and similar commands.
What the user can say depends upon the context: is the user controlling an email application, reading a credit card number, or selecting a font? Applications know the context, so applications are responsible for providing a speech recognizer with appropriate grammars.
This document is the specification for the Java Speech Grammar Format. First, the basic naming and structural mechanisms are described. Following that, the basic components of the grammar, the grammar header and the grammar body, are described. The grammar header declares the grammar name and lists the imported rules and grammars. The grammar body defines the rules of this grammar as combinations of speakable text and references to other rules. Finally, some
simple examples of grammar declarations are provided. Grammars are used by speech recognizers to determine what the recognizer should listen for, and so describe the utterances a user may say. A Java Speech Grammar Format document starts with a self-identifying header. This header identifies that the document contains JSGF and indicates the version of JSGF being used (currentlyV1.0). JSGFV 1:0.The grammar body defines rules. Each rule is defined in a rule definition. A rule is defined once in a grammar. The order of definition of rules is not significant. Rule Name >= rule Expansion; public < rule Name >= rule Expansion;
Recognizer- Contains the main components of Sphinx-4, which are the front end, the linguist, and the decoder. The application interacts with the Sphinx-4 system mainly via the Recognizer.
Audio - The data to be decoded. This is audio in most systems, but it can also be configured to accept other forms of data, e.g., spectral or cepstral data.
Front End- Performs digital signal processing (DSP) on the incoming data.
Feature- The output of the front end are features, which are used for decoding in the rest of the system.
Linguist- Embodies the linguistic knowledge of the system, which are the acoustic model, the dictionary, and the language model. The linguist produces a search graph structure on which the search manager performs search using different algorithms.
Sphinx-4 Architecture
Acoustic Model- Contains a representation (often statistical) of a sound, often created by training using lots of acoustic data
Language Model- Contains a representation (often statistical) of the probability of occurrence of words.
Search Graph- The graph structure produced by the linguist according to certain criteria (e.g., the grammar), using knowledge from the dictionary, the acoustic model, and the language model.
Search Manager- Performs search using certain algorithm used, e.g., breadth first search, best-first search, depth-first search, etc.. Also contains the feature scorer and the pruner.
Active List- A list of tokens representing all the states in the search graph that are active in the current feature frame.
Scorer- Scores the current feature frame against all the active states in the Active List.
Result- The decoded result, which usually contains the N-best results.
Configuration Manager- loads the Sphinx-4 configuration data from an XML based file, and manages the component life cycle for objects.
Recognition Issue:
Goal:
Audio goes in Results come out
Front-End:
Transforms speech waveform into features used by recognition Features are sets of mel-frequency cepstrum coefficients (MFCC) MFCC model human auditory system Front-End is a set of signal processing filters Pluggable architecture
Knowledge Base:
The data that drives the decoder Consists of three sets of data: Dictionary Acoustic Model Language Model
DICTIONARY:
Maps words to pronunciations Provides word classification information (such as part-of- speech) Single word may have multiple pronunciations Pronunciations represented as phones or other units Can vary in size from a dozen words to >100,000 words
Language Model:
Describes what is likely to be spoken in a particular context Uses stochastic approach. Word transitions are defined in terms of transition probabilities Helps to constrain the search space
Acoustic Models:
Database of statistical models Each statistical model represents a ingle unit of speech such as a word or phoneme Acoustic Models are created/trained by analyzing large corpora of labeled speech Acoustic Models can be speaker dependent or speaker independent
FEASIBILTY STUDY
It is feasible because it is being frequently used in various areas like military, telephone, healthcare etc. It is also used by topmost industries for the recognition of their employees in their attendance process. So it is feasible and can be completed in given period. A Real-Time Voice Recognition Security System can be developed using the different algorithm.
Economical Feasibility:
This feasibility deals with the cost/benefit analysis. A number of intangible benefits like user friendliness, robustness and security were pointed out. The cost that will be incurred upon the implementation of this project would be quite nominal.
Operational Feasibility:
The developed system will be very reliable and user friendly. All the features and operations that we will implement in our project are possible to implement and thus feasible. This will facilitate easy use and adoptability of the system. With the use of menus, and proper validation required it become fully understandable to the common user and operational with the user.
Covering letter:
It is formally presents the report with brief description of the project problem along with recommendation to be considered.
Table of content:
It lists the section of feasibility study report along with their page number.
System requirement:
The system requirements, which are either derived from the existing system or from the discussion with the users, are presented in this section.
Development plan:
It present a detailed plan with the starting and completion dates for different phases of SDLC. Complimentary planes also needed for hardware and software evaluation, purchase and installation.
Costs and benefits: The detailed findings of cost and benefits analysis are presented in this section. The saving and benefits are highlighted to justify the economic feasibility of this project.
REQUIREMENT ANALYSIS
A requirement is a condition or capability that must be met or possessed by a system to satisfy a contract, standard, specification or other formally imposed specification of the client. This phase ends with the Software Requirements Specifications (SRS). The SRS is a document that completely describes what the proposed software should do without describing how the software will do it.
Analysis Methodology:
A complete understanding of requirement is essential for success of a project. This is done by gathering information, the approach and manner in which sensitivity, commonsense and knowledge of what and when to gather and what to use in securing information. There are various tools for gathering during the phase of system analysis. The phases are:1. Familiarity with the present through available documentation, such as procedure manuals, document and their flow, interviews of user staff and on site observation. 2. Defining of decision making associated with managing the system. This is important for determining what information is required of the system conduction interview clarifies the decision point and how decision made in user area. 3. Once decision point is identified, a database may be conduct to define the information requirement. The information gathered is analyzed and documented. Discrepancies between decision system and information gathered from the information system are identified. This concludes the analysis and sets the stage for system design.
Organization based information deals with policies, objectives, goals and structure. User based information focuses on information requirement. Work based information addresses the work flow, method & procedure and workstation. We are interested in what happened to data through various point in system.
SYSTEM REQUIREMENTS:
SOFTWARE REQUIREMENTS: Language :Java SDK, Eclipse Front End Tool: Sphinx-4 Back End Tool : Oracle 10g for database. Operating system :Windows XP/7 Microsoft Word is used for documentation.
HARDWARE REQUIREMENTS:
Processor: PC with a Pentium IV-class processor, 600 MHz, Recommended: Pentium IV-class, 1.63 GHz. RAM : 1 GB Hard Disk Space: 20 GB on system drive, 10 GB for development environment. Microphone : Good Quality microphone.
SYSTEM ANALYSIS
System Analysis is a term used to describe the process of calculating and analyzing facts in respect of existing operation of the prevailing situations that an effective computerized system may be designed and implemented if provided feasible. This is required in order to understand the problem that has to be solved. The problem may be of any kind like computerizing an existing system or developing an entirely new system or it can be a combination of two. Basically system analysis is used to describe the process of calculating and analyzing facts related to the existing operations of the prevailing situation, so that an effective and accurate computerized system may be designed and implemented if feasible. This is required in order to understand the problem the problem that has to be solved. To solve the problem in actual sense is not the aim of designing phase, but to see how the problem can be solved. For this the logical model of the system is required, providing the way to solve the problem and achieving the desired goal. The logical view of the system is provided to the developer and user for decision making such that developer can fee lease in designing the system.
SPECIFICATION OF PROJECT
The proposed system should have following features: 1. It should be able to store voices in .wav format. 2. It should be able to store usernames in database. 3. It should provide the option for existing and new user.
4. It should have the ability of processing voice prints. 5. It should closely match the voices. 6. It should recognize speech up to a reasonable extent. 7. It should provide proper guidance to the user to use it. 8. It should give fast results.
SYSTEM DESIGN
System Design is the technique of creating a system that takes into notice such factors such as needs, performance levels, database design, hardware specifications, and data management. It is the most important part in the development part in the development of the system, as in the design phase the developer brings into existence the proposed system the analyst through of in the analysis phase.
DESIGN CONCEPT
Software design sites at the technical kernel of software engineering and is applied regardless of the software process model that is used. After software requirements have been analyzed and specified. Software design is the first of three technical activities-designs, code generation and test-that are required to build and verify the software. Each activity transforms information in a manner that utility results in validated computer software. The design transforms the information domain model created during analysis into the data structure that will be required to implement the software. The data objects and relationship diagram and the detailed data content depicted in the data dictionary provide the basis for the design activity .As aforesaid Design is that phase of software engineering that tells all about the completion of a project or complete failure. In our project Face Recognition System we have spent maximum time on Image preprocessing & processing. Now we are ready with processed images so as to make it easier for the user to match images. Also data flow diagrams for the project has been developed. While developing this project we have gone through various angles of images. The training data base structures are well defined with complete description of images about the used. Another part which took most of our consideration is that we decided to create the user input for directly giving path of images in the
dialog box and then executing each of them. The architectural design defined the relationship between major structure elements of the software, the design patterns that can be used to achieve the requirements that have been defined for the system and the constraints that affect the way in which architectural design pattern can be applied. The interface design describes how the software communicates with in itself, with systems that interoperate with it, and with humans who use it .An interface applies a flow of information and a specific type of behavior. Design is the phase where quality is fostered in website designing. Design provides us with representations of software that can be assessed for quality. Design is the only way that we can accurately translate a customers requirement into a finished software product or systems. Website design serves as the foundation of the software support steps that follow.
described logically and independently of physical components associated with the system. These are known as the logical data flow diagrams. The physical data flow diagrams show the actual implements and movement of data between people, departments and workstations. A full description of a system actually consists of a set of data flow diagrams. Using two familiar notations Yourdon, Gane and Sarson notation develops the data flow diagrams. Each component in a DFD is labelled with a descriptive name. Process is further identified with a number that will be used for identification purpose. The development of DFDs is done in several levels. Each process in lower level diagrams can be broken down into a more detailed DFD in the next level. The lop-level diagram is often called context diagram. It consists a single process bit, which plays vital role in studying the current system. The process in the context level diagram is exploded into other process at the first level DFD. The idea behind the explosion of a process into more process is that understanding at one level of detail is exploded into greater detail at the next level. This is done until further explosion is necessary and an adequate amount of detail is described for analyst to understand the process. Larry Constantine first developed the DFD as a way of expressing system requirements in a graphical from, this lead to the modular design. A DFD is also known as a bubble Chart has the purpose of clarifying system requirements and identifying major transformations that will become programs in system design. So it is the starting point of the design to the lowest level of detail. A DFD consists of a series of bubbles joined by data flows in the system.
DFD SYMBOLS:
In the DFD, there are four symbols 1. A square defines a source(originator) or destination of system data. 2. An arrow identifies data flow. It is the pipeline through which the information flows. 3. A circle or a bubble represents a process that transforms incoming data flow into outgoing data flows. 4. An open rectangle is a data store, data at rest or a temporary repository of data.
Data flow
Data Store
CONSTRUCTING DFD:
Several rules of thumb are used in drawing DFDs: 1. Process should be named and numbered for an easy reference. representative of the process. 2. The direction of flow is from top to bottom and from left to right. Data traditionally flow from source to the destination although they may flow back to the source. One way to indicate this is to draw long flow line back to a source. An alternative way is to repeat the Each name should be
source symbol as a destination. Since it is used more than once in the DFD it is marked with a short diagonal. 3. When a process is exploded into lower level details, they are numbered. 4. The names of data stores and destinations are written in capital letters. Process and dataflow names have the first letter of each work capitalized. 5. A DFD typically shows the minimum contents of data store. Each data store should contain all the data elements that flow in and out. Questionnaires should contain all the data elements that flow in and out. Missing interfaces redundancies and like is then accounted for often through interviews.
CURRENT PHYSICAL:
In Current Physical DFD process label include the name of people or their positions or the names of computer systems that might provide some of the overall system-processing label includes an identification of the technology used to process the data. Similarly data flows and data stores are often labels with the names of the actual physical media on which data are stored such as file folders, computer files, business forms or computer tapes.
CURRENT LOGICAL:
The physical aspects at the system are removed as mush as possible so that the current system is reduced to its essence to the data and the processors that transforms them regardless of actual physical form.
NEW LOGICAL:
This is exactly like a current logical model if the user were completely happy with he user were completely happy with the functionality of the current system but had problems with how it was implemented typically through the new logical model will differ from current logical model while having additional functions, absolute function removal and inefficient flows recognized.
NEW PHYSICAL:
The new physical represents only the physical implementation of the new system.
DATA FLOW
1) A Data Flow has only one direction of flow between symbols. It may flow in both directions between a process and a data store to show a read before an update. The later is usually indicated however by two separate arrows since these happen at different type. 2) A join in DFD means that exactly the same data comes from any of two or more different processes data store or sink to a common location. 3) A data flow cannot go directly back to the same process it leads. There must be at least one other process that handles the data flow produce some other data flow returns the original data into the beginning process. 4) A Data flow to a data store means update (delete or change). 5) A data Flow from a data store means retrieve or use. 6) A data flow has a noun phrase label more than one data flow noun phrase can appear on a single arrow as long as all of the flows on the same arrow move together as one package.
and data stores" .Each process is then decomposed into an "even-lower-level diagram containing its sub processes". This approach "then continues on the subsequent sub processes", until a necessary and sufficient level of detail is reached which is called the primitive process
Chapter 7
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); URL u = new URL("http://feeds.bbci.co.uk/news/world/asia/rss.xml"); // your feed url Document doc = builder.parse(u.openStream()); NodeList nodes = doc.getElementsByTagName("item"); for(int i=0;i<nodes.getLength();i++) { Element element = (Element)nodes.item(i); /*System.out.println("Title: " + getElementValue(element,"title")); System.out.println("Link: " + getElementValue(element,"link")); System.out.println("Publish Date: " + getElementValue(element,"pubDate")); System.out.println("author: " + getElementValue(element,"dc:creator")); System.out.println("comments: " + getElementValue(element,"wfw:comment")); System.out.println("description: " + getElementValue(element,"description")); System.out.println();*/ System.out.println(s); if(i==0) { newsInBrief = newsInBrief+" News? "+(i+1)+ "! "; } else { newsInBrief = newsInBrief+"The next news is? News! "+(i+1)+ "! "; } newsInBrief=newsInBrief+" !\n"+getElementValue(element,"title")+"? Now Describing the news! \n"+getElementValue(element,"description")+" !and? "; headLines=headLines+getElementValue(element,"title")+"!"; }//for //return s; }//try catch(Exception ex) { ex.printStackTrace(); } //s= headLines+"! "+newsInBrief; s=s+newsInBrief; return s; } private String getCharacterDataFromElement(Element e) {
try { Node child = e.getFirstChild(); if(child instanceof CharacterData) { CharacterData cd = (CharacterData) child; return cd.getData(); } } catch(Exception ex) { } return ""; } //private String getCharacterDataFromElement protected float getFloat(String value) { if(value != null && !value.equals("")) { return Float.parseFloat(value); } return 0; } protected String getElementValue(Element parent,String label) { return getCharacterDataFromElement((Element)parent.getElementsByTagName(label).item(0)); } /*public static void main(String[] args) { RSSReader reader = RSSReader.getInstance(); reader.writeNews(); } */ }
2.Class TaskLauncher1
package com.cvrce.projects.launcher; import java.awt.*; //import java.awt.event.*; import com.sun.speech.freetts.Voice; import com.sun.speech.freetts.VoiceManager; import com.sun.speech.freetts.audio.AudioPlayer; import java.io.*; import edu.cmu.sphinx.frontend.util.Microphone;
public class TaskLauncher1 extends Frame { static int type; //mediaType=1 for movie, =2 for song and =3 for file Frame f; TextArea t1; public TaskLauncher1() { f=new Frame("BBC News"); //setLayout(new FlowLayout()); t1=new TextArea(200,200); //t1.setSize(100, 50); f.add(t1); f.setSize(1200,700); } public Boolean launchTask(String task) { System.out.println("Launcher received : "+task); // Microphone microphone=new Microphone(); try { if(task.contains("movie")) { type=1; // microphone.stopRecording(); String s=new String("Select your movie! say? 1? for Sixth sense? 2? for Illusionist? 3? for Madagascar? 4? for shrek? and 5? for Impact"); voice1(s); //microphone.startRecording(); } else if(task.contains("song")) { type=2; String s=new String("Select your Music? say 1? for Chak de India? 2? for Give me some sun shine? 3? for iss pal? 4? for miss independent and 5? for Kaash ik din "); voice1(s);
} //Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\Low.mp3"); //Runtime.getRuntime().exec("E:\\Music\\Low.mp3"); else if(task.contains("data file")) { type=3; int i=0; String s=new String("Select whose biodata file to read? say 1? for samarpita? 2? for pranita? 3? for snigdha? and 4? for ellora green"); voice1(s); //fileread(i); } if(task.contains("one")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\Sixth_sense.avi"); } //if media type is song if(type == 2) { //play first song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\ChakDe.mp3"); } //if type is file if(type==3) fileread(1); }
//if user says two if(task.contains("two")) { //if media type is movie if(type == 1) { //play 2nd movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\The_Illusionist.avi"); } //if media type is song if(type == 2) { //play 2nd song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\3idiots04.mp3"); } //if type is file if(type==3) fileread(2); }
//if user says Three if(task.contains("three")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\madagascar2.mkv"); } //if media type is song if(type == 2)
{ //play first song Runtime.getRuntime().exec("D:\\VLC\\vlcE:\\Music\\Ispal.mp3"); } //if type is file if(type==3) fileread(3); } //if user says four if(task.contains("four")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Movies\\Shrek1.avi"); } //if media type is song if(type == 2) { //play first song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\MissIndependent.mp3"); } //if type is file if(type==3) fileread(4);
//if user says five if(task.contains("five")) { //if media type is movie if(type == 1) { //play first movie Runtime.getRuntime().exec("D:\\VLC\\vlc D:\\Impact.avi"); } //if media type is song if(type == 2) { //play first song Runtime.getRuntime().exec("D:\\VLC\\vlc E:\\Music\\dwnlds\\showbiz03.mp3"); } }
else if(task.contains("news")) readRSS(); else if(task.contains("snap")) Runtime.getRuntime().exec("D:\\PicasaPhotoViewerD:\\friends.jpg"); else { String s=new String(""); }
} return false; } public void listAllVoices() { VoiceManager voiceManager = VoiceManager.getInstance(); Voice[] voices = voiceManager.getVoices(); } public void voice1(String s) { listAllVoices(); String voiceName = "kevin16"; /* The VoiceManager manages all the voices for FreeTTS. */ VoiceManager voiceManager = VoiceManager.getInstance(); Voice helloVoice = voiceManager.getVoice(voiceName); if (helloVoice == null) { System.err.println( "Cannot find a voice named " + voiceName + ". Please specify a different voice."); System.exit(1); } /* Allocates the resources for the voice. */ helloVoice.allocate(); /* Synthesize speech. */ helloVoice.speak(s); helloVoice.deallocate();
} public void fileread(int i)throws Exception { String s1=new String(); if (i==1) { s1="D:/sambiodata.txt"; //Runtime.getRuntime().exec("D://sambiodata.txt"); } if (i==2) { s1="D:/prabiodata.txt"; //Runtime.getRuntime().exec("D://prabiodata.txt"); } if (i==3) { s1="E:/snicv.txt"; //Runtime.getRuntime().exec("E://snicv.txt"); } if (i==4) { s1="E:/ellucv.txt"; //Runtime.getRuntime().exec("E://ellucv.txt"); } FileReader fr = new FileReader(s1); BufferedReader br = new BufferedReader(fr); String s2; while((s2 = br.readLine())!= null) { System.out.println(s2); voice1(s2); } fr.close(); }
public void readRSS() { RSSReader reader = RSSReader.getInstance(); String s=reader.writeNews(); f.setVisible(true); t1.setText(s); //speak the news voice1(s); } }
3.Class VoiceResponseSystem
/* * Copyright 1999-2004 Carnegie Mellon University. * Portions Copyright 2004 Sun Microsystems, Inc. * Portions Copyright 2004 Mitsubishi Electric Research Laboratories. * All Rights Reserved. Use is subject to license terms. * * See the file "license.terms" for information on usage and * redistribution of this file, and for a DISCLAIMER OF ALL * WARRANTIES. * */ package com.cvrce.projects.speech; import com.cvrce.projects.launcher.TaskLauncher1; import com.sun.speech.freetts.Voice; import com.sun.speech.freetts.VoiceManager; import edu.cmu.sphinx.frontend.util.Microphone; import edu.cmu.sphinx.recognizer.Recognizer; import edu.cmu.sphinx.result.Result; import edu.cmu.sphinx.util.props.ConfigurationManager;
/**
* A Program showing a simple speech application built using Sphinx-4. This application uses the Sphinx-4 * endpointer, which automatically segments incoming audio into utterances and silences. */ public class VoiceResponseSystem { public void listAllVoices() { VoiceManager voiceManager = VoiceManager.getInstance(); Voice[] voices = voiceManager.getVoices(); } public void voice1(String s) { listAllVoices(); String voiceName = "kevin16"; System.out.println(); //System.out.println("Using voice: " + voiceName); /* The VoiceManager manages all the voices for FreeTTS. */ VoiceManager voiceManager = VoiceManager.getInstance(); Voice helloVoice = voiceManager.getVoice(voiceName); if (helloVoice == null) { System.err.println( "Cannot find a voice named " + voiceName + ". Please specify a different voice."); System.exit(1); } /* Allocates the resources for the voice. */ helloVoice.allocate();
public static void main(String[] args) { String s1= new String("Hello and welcome to Voice response system?! select your option? " + " say movie? to watch a movie? song? to listen a song?! news? to listen news? " + "Data file? to listen the containts of biodata file? and? say snap? to view a picture?"); VoiceResponseSystem v1=new VoiceResponseSystem(); //v1.voice1(s1); ConfigurationManager cm; if (args.length > 0) { cm = new ConfigurationManager(args[0]); } else { cm = new ConfigurationManager(VoiceResponseSystem.class.getResource("vrs.config.xml")); } Recognizer recognizer = (Recognizer) cm.lookup("recognizer"); recognizer.allocate(); // start the microphone or exit if the programm if this is not possible Microphone microphone = (Microphone) cm.lookup("microphone"); if (!microphone.startRecording()) { System.out.println("Cannot start microphone."); recognizer.deallocate();
System.exit(1); } System.out.println("Ask: Song/News/Data File/Movie/Snap"); // loop the recognition until the programm exits.
while (true) { System.out.println("Start speaking.\n"); Result result = recognizer.recognize(); if (result != null) { String resultText = result.getBestFinalResultNoFiller(); System.out.println("You said: " + resultText + '\n'); TaskLauncher1 tl = new TaskLauncher1(); tl.launchTask(resultText); // microphone.stopRecording(); // recognizer.deallocate(); } else { } } } }
4.Grammar File
#JSGF V1.0; /** * JSGF Grammar for Hello World example */ grammar hello; public <greet> = ( Song | News | Data File | Movie | One | Two | Three | Four | Five | Snap );
1.Selecting song Example 1: After selecting song, it asks for other options under this action, like Saying one for song Chak de India, two for Give me some sunshine etc.
Example 2:
2.Selecting Photo:
After selecting option snap it opens a picture friends.jpg as shown below.
3.Selecting movie:
After selecting movie, it asks for other options under this action, like Saying one for movie The sixth sense, two for The Illusionist etc.
Example 2:
4.Selecting News:
After selecting this option, it connects to the bbc news rss feed i.e, http://feeds.bbci.co.uk/news/world/asia/rss.xml
Chapter 9 DISCUSSION
The modular framework of Sphinx-4 has permitted us to do some things very easily that have been traditionally difficult. The modular nature of Sphinx-4 also provides it with the ability to use modules whose implementations range from general to specific applications of an algorithm. For example, we were able to improve the runtime speed for the RM1 regression test by almost 2 orders of magnitude merely by plugging in a new Linguist and leaving the rest of the system the same. Furthermore, the modularity of Sphinx-4 also allows it to support a wide variety of tasks. For example, the various SearchManager implementations allow Sphinx-4 to efficiently support tasks that range from small vocabulary tasks implementations allow Sphinx-4 to support different tasks such as traditional CFG-based command-and-control applications in addition to applications that use stochastic language models. The modular nature of Sphinx-4 was enabled primarily by the use of the Java programming language. In particular, the ability of the Java platform to load code at run time permits simple support for the pluggable framework, and the Java programming language construct of interfaces permits separation of the framework design from the implementation. The Java platform also provides Sphinx-4 with a number of other advantages: Sphinx-4 can run on a variety of platforms without the need for recompilation The rich set of platform APIs greatly reduces coding time Built-in support for multithreading makes it simple to experiment with distributing decoding tasks across multiple threads Automatic garbage collection helps developers to concentrate on algorithm development instead of memory leaks On the downside, the Java platform can have issues with memory footprint. Also related to memory, some speech engines will directly access the platform memory directly in order to optimize the memory throughput during decoding. Direct access to the platform memory model is not permitted with the Java programming language. A common misconception people have regarding the Java programming language is that it is too slow. When developing Sphinx-4, we carefully instrumented the code to measure various aspects of the system, comparing the results to its predecessor.
Table I provides a summary showing that Sphinx-4 performs well (for both WER and RT, a lower number indicates better performance). An interesting result of this helps to demonstrate the strength of the pluggable and modular design of Sphinx-4. we were able to plug in different implementations of the Linguist and SearchManager that were optimized for the particular tasks, allowing Sphinx-4 to perform much better. Another interesting aspect of the performance study shows us that raw computing speed is not our biggest concern when it comes to RT performance. For the 2 CPU results in this table, we used a Scorer that equally divided the scoring task across the available CPUs. While the increase in speed is noticeable, it is not as dramatic as we expected. Further analysis helped us determine that only about 30 percent of the CPU time is spent doing the actual scoring of the acoustic model states. The remaining 70 percent is spent doing non-scoring activity, such as growing and pruning the ActiveList. Our results also show that the Java platforms garbage collection mechanism only accounts for 2-3 percent of the overall CPU usage.
TEST
WER
TI46(11 WORDS) TIDIGITS(11 WORDS) AN4(79 WORDS) RM1(1000 WORDS) WSJ5K(5000 WORDS) 0.168 0.549 1.192 2.739 7.174
RT
0.02 0.05 0.20 0.40 0.96
(Sphinx-4 performance.word error rate (wer) is given in percent. Real time (rt) speed is the ratio of utterance duration to the Time to decode the utterance.)
Results:
The test cases mentioned in the previous slide have been found to produce correct results given the voice is recognized correctly. However, the voice recognition is not 100 percent accurate. It may sometimes lead to frustrating results.
Known Bugs/Defects
Since the project is based on voice recognition, the accuracy while working is not very high.Sometimes, it may so happen that we speak at the loudest of our voice levels in as clear pronunciation as possible and yet the program might misunderstand what is spoken. It cannot be attributed as a bug in the project, but is for sure a defect which arises due to large number of factors. Some of these factors may be the noise interference from the environment, difference in the accent of the user and the accent on which the program is trained to understand etc.
Workaround:
While no perfect solution for this can be implemented, we can have a workaround. This is to train the program to understand accent of a specific user which will in turn result in higher accuracy.
Chapter 10 CONCLUSION
ADVANTAGES:
Able to write the text through both keyboard and voice input. Voice recognition of different notepad commands such as open save and clear. Open different windows softwares, based on voice input. Requires less consumption of time in writing text. Provide significant help for the people with disabilities. Lower operational costs.
DISADVANTAGES:
Low accuracy Not good in the noisy environment
After careful development of the Sphinx-4 framework, we created a number of differing implementations for each module in the framework. For example, the Front End implementations support MFCC, PLP, and LPC feature extraction; the Linguist implementations support a variety of language models, including CFGs, FSTs, and N-Grams; and the Decoder supports a variety of Search Manager implementations. Using the Configuration Manager, the various
implementations of the modules can be combined in various ways, supporting our claim that we have developed a flexible pluggable framework. Furthermore, the framework is performing well both in speed and accuracy when compared to its predecessors. The Sphinx-4 framework is already proving itself as being research ready, easily supporting various work as well as a specialized Linguist. We view this as only the very beginning, however, and expect Sphinx-4 to support future areas of core speech recognition research. Finally, the source code to Sphinx-4 is freely available. The license permits others to do academic and commercial research and to develop products without requiring any licensing fees. More information is available at http://cmusphinx.sourceforge.net/sphinx4. This Thesis/Project work of voice response system started with a brief introduction of the technology and its applications in different sectors. The project part of the Report was based
on software development for voice response system. In the later stage we discussed different tools for bringing that idea into practical work. After the development of the software finally it was tested and results were discussed, few deficiencies factors were brought in front. After the testing work, advantages of the software were described and suggestions for further enhancement and improvement were discussed.
Future Enhancements
This work can be taken into more detail and more work can be done on the project in order to bring modifications and additional features. The current software doesnt support a large vocabulary, the work will be done in order to accumulate more number of samples and increase the efficiency of the software. The current version of the software supports only few areas but more areas can be covered and effort will be made in this regard.
Chapter 11 BIBLIOGRAPHY
[1] S. Young, The HTK hidden Markov model toolkit: Design and philosophy, Cambridge University Engineering Department, UK, Tech. Rep. CUED/FINFENG/ TR152, Sept. 1994. [2] N. Deshmukh, A. Ganapathiraju, J. Hamaker, J. Picone, and M. Ordowski, A public domain speech-to-text system, in Proceedings of the 6th European Conference on Speech Communication and Technology, vol. 5, Budapest, Hungary, Sept. 1999, pp. 21272130. [3] X. X. Li, Y. Zhao, X. Pi, L. H. Liang, and A. V. Nefian, Audio-visual continuous speech recognition using a coupled hidden Markov model, in Proceedings of the 7th International Conference on Spoken Language Processing, Denver, CO, Sept. 2002, pp. 213216. [4] K. F. Lee, H. W. Hon, and R. Reddy, An overview of the SPHINX speech recognition system, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, no. 1, pp. 3545, Jan. 1990. [5] X. Huang, F. Alleva, H. W. Hon, M. Y. Hwang, and R. Rosenfeld, The SPHINX-II speech recognition system: an overview, Computer Speech and Language, vol. 7, no. 2, pp. 137148, 1993. [6] M. K. Ravishankar, Efficient algorithms for speech recognition, PhD Thesis (CMU Technical Report CS-96-143), Carnegie Mellon University, Pittsburgh, PA, 1996. [7] P. Lamere, P. Kwok, W. Walker, E. Gouvea, R. Singh, B. Raj, and P. Wolf, Design of the CMU Sphinx-4 decoder, in Proceedings of the 8th European Conference on Speech Communication and Technology, Geneve, Switzerland, Sept. 2003, pp. 11811184. [8] J. K. Baker, The Dragon system - an overview, in IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 23, no. 1, Feb. 1975, pp. 2429. [9] B. T. Lowerre, The Harpy speech recognition system, Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, PA, 1976. [10] J. K. Baker, Stochastic modeling for automatic speech understanding, in Speech Recognition, R. Reddy, Ed. New York: Academic Press, 1975, pp. 521542.
[11] P. Placeway, S. Chen, M. Eskenazi, U. Jain, V. Parikh, B. Raj, M. Ravishankar, R. Rosenfeld, K. Seymore, M. Siegler, R. Stern, and E. Thayer, The 1996 HUB-4 Sphinx-3 system, in Proceedings of the DARPA Speech Recognition Workshop. Chantilly, VA: DARPA, Feb. 1997. [Online]. Available: http://www.nist.gov/speech/publications/darpa97/pdf/placewa1.pdf [12] M. Ravishankar, Some results on search complexity vs accuracy, in Proceedings of the DARPA Speech Recognition Workshop. Chantilly, VA: DARPA, Feb. 1997. [Online]. Available:http://www.nist.gov/speech/publications/darpa97/pdf/ravisha1.pdf [13] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1998. SMLI TR2004-0811 c2004 SUN MICROSYSTEMS INC. 9 [14] X. Huang, A. Acero, F. Alleva, M. Hwang, L. Jiang, and M. Mahajan, From SPHINX-II to Whisper: Making speech recognition usable, in Automatic Speech and Speaker Recognition, Advanced Topics, C. Lee, F. Soong, and K. Paliwal, Eds. Norwell, MA: Kluwer Academic Publishers, 1996. [15] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllable word recognition in continuously spoken sentences, in IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 28, no. 4, Aug. 1980. [16] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 17381752, 1990. [17] NIST. Speech recognition scoring package (score). [Online]. Available: http://www.nist.gov/speech/tools [18] G. D. Forney, The Viterbi algorithm, Proceedings of The IEEE, vol. 61, no. 3, pp. 268 278, 1973. [19] P. Kenny, R. Hollan, V. Gupta, M. Lenning, P. Mermelstein, and D. OShaugnessy, A*admissible heuristics of rapid lexical access, IEEE Transactions on Speech and Audio Processing, vol. 1, no. 1, pp. 4959, Jan. 1993 . [20] Java speech API grammar format (JSGF). [Online]. Available:http://java.sun.com/products/java-media/speech/forDevelopers/JSGF/ [21] M. Mohri, Finite-state transducers in language and speech processing, Computational Linguistics, vol. 23, no. 2, pp. 269311, 1997. [22] P. Clarkson and R. Rosenfeld, Statistical language modeling using the CMU-cambridge toolkit, in Proceedings of the 5th European Conference on Speech Communication and Technology, Rhodes, Greece, Sept. 1997. [23] Carnegie Mellon University. CMU pronouncing dictionary. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
[24] S. J. Young, N. H. Russell, and J. H. S. Russell, Token passing: A simple conceptual model for connected speech recognition systems, Cambridge University Engineering Dept, UK, Tech. Rep. CUED/F-INFENG/TR38, 1989. [25] R. Singh, M. Warmuth, B. Raj, and P. Lamere, Classification with free energy at raised temperatures, in Proceedings of the 8th European Conference on Speech Communication and Technology, Geneve, Switzerland, Sept. 2003, pp. 17731776. [26] P. Kwok, A technique for the integration of multiple parallel feature streams in the Sphinx4 speech recognition system, Masters Thesis (Sun Labs TR-2003-0341), Harvard University, Cambridge, MA, June 2003. [27] P. Price, W. M. Fisher, J. Bernstein, and D. S. Pallett, The DARPA 1000-word resource management database for continuous speech recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 1. IEEE, 1988, pp. 651654. [28] G. R. Doddington and T. B. Schalk, Speech recognition: Turning theory to practice, IEEE Spectrum, vol. 18, no. 9, pp. 2632, Sept. 1981. [29] R. G. Leonard and G. R. Doddington, A database for speaker-independent digit recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 3. IEEE, 1984, p. 42.11. [30] J. Garofolo, E. Voorhees, C. Auzanne, V. Stanford, and B. Lund, Design and preparation of the 1996 HUB-4 broadcast news benchmark test corpora, in Proceedings of the DARPA Speech Recognition Workshop. Chantilly, Virginia: Morgan Kaufmann, Feb. 1997, pp. 1521. [31] (2003, Mar.) Sphinx-4 trainer design. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/Train%erDesign [32] J. R. Glass, A probablistic framework for segment-based speech recognition, Computer Speech and Language, vol. 17, no. 2, pp. 137152, Apr. 2003.