Beruflich Dokumente
Kultur Dokumente
y y
Voice browsing with VoiceXML y VoiceXML architecture y VoiceXML Programming y Future of VoiceXML y Summary
Presentation Agenda
In the mid- to late 1990s, personal computers started to become powerful enough to support ASR y The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).
y
Voice Technologies
Speech Recognition
)
Speech Synthesis
E-business has changed from client-server model to web-centric model y Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet. y Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.
y
VoiceXML instead of HTML y A voice browser instead of an ordinary web browser y Phone instead of PC.
y
Voice Browsing
Speech Input: speech recognition and DTMF y Speech Output: pre-recorded audio and synthesized speech y Internet: XML, IP, HTTP, SSL, JavaScript y Telephony: call transfer, data passing
y
Founded May 1999 y 60 company members y Mission Standards group to prepare and review markup languages to enable internet-based speech applications y http://www.w3.org/Voice
y
Industry Group to promote VoiceXML y 550+ member companies y Submitted VoiceXML 1.0 to W3C in May 2000 y http://www.voicexml.org
y
VoiceXML Forum
VoiceXML v2.0
W3C Voice Browser Working Group 50+ members collaborating Addressed 400+ change requests
y y
y y y
A language for specifying voice dialogs. Voice dialogs use audio prompts and text-tospeech (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input. Main input/output device (initially) is the phone. Leverages the Internet for application development and delivery. Standard language enables portability.(VoiceXML Dialog )
VoiceXML Overview
Telephone and Platform VoiceXML Telephone networkConnects callers telephone with Architecture-1 ArchitectureTelephony Server y VoiceXML Gateway
y
Voice Browser Audio input-Speech Recognition (ASR), Touchtone (DTMF), Audio recording. Audio output-Audio playback, Speech Synthesis (TTS) Interface, Call Controls
VoiceXML Documents
Dialog and flow control Client-side scripting (ECMAScript) Speech Recognition grammar Speech Synthesis pronunciation control
Application servers
Generate VoiceXML documents dynamically. Server-side application logic Connect to Database, or database interface
VoiceXMLbrowser
DB
Voice Gateway
Free
IBM VoiceServerSDK
Open Source
CMU:OpenVXI
DEMO
A Simple VoiceXML application to introduce the department of Computer Science . y Exp. show that to build a corresponding HTML version first is helpful.
y
Document
A VoiceXML document defines one or more dialogs The user is always in one dialog at any time Each dialog specifies the next dialog to transition to using a URL
doc1.vxml Dialog 1
Transition: #dialog 2
Dialog 2
Transition: http://xyz.com/doc2.vxml
A Dialog describes an interaction between a user and the system y Two kinds of dialogs: form and menu
y
Dialog
Form
Grammar
filed
<form> <field name="travellers> <grammar mode=voice src=./number.grxml/> <prompt>How many are travelling?</prompt> <filled> <submit next=http://travel.com/order/> </filled> </field> </form>
input output
eval
Form
<menu id=commands> What service would you like? <choice next=/cars> <choice next=/news> </menu> Car hire Todays news </choice> </choice> <choice next=/hotels> Hotel reservations </choice>
menu menu
Menu
Typically used to send results from client to server y Syntax: <submit next=URI namelist=var1 var2 .../> y namelist: Fields
y
Submit
Submit, Example
<form> <field name=dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=voice src=./cities.grxml/> </field> <field name="travellers> <prompt> How many are travelling to <value expr="city"/>? </prompt> <grammar mode=voice src=./number.grxml/> </field> <filled> Thank you. Your order is now being processed. <submit next="http://travel.com/order" namelist=dest-city travellers"/> </filled> </form>
: <field name="user2"> : <assign name="user1" expr=peter"/> : <clear namelist="user1 user2"/> : How many are travelling to <value expr=dest-city/> ?
x $
Variables
Scope defined by element containing executable content (<block>, <filled> or event handler)
:Events
Events are used to signal unexpected situations y Events are caught by an catch event handler
y
<catch event=com.acme.mailreader>...</catch> <catch event=nomatch noinput>...</catch> Shortcut: <nomatch> is equivalent to <catch event="nomatch"> Other shortcuts: <noinput>, <error>
<field name=dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=voice src=./cities.grxml/> <nomatch> Please say the city you want to fly to. </nomatch> </field>
Events, Example
Sun/SpeechWorks (1999)
W3C
JSML JSGF
VoiceXML forum (2000) W3C (2003 in CR)
VoiceXML 3? Speech synthesis (SSML) Speech reco. grammar Speech semantics NLP Pronunciation lexicon [early] Call control [early] [early]
VoiceXML 1.0
VoiceXML 2.0
Speech is the most natural way for human to communicate thus it will become an important way in HCI. y VoiceXML has revolutionized speech recognition & telephony application development & deployment.
y
Conclusion
Q&A
Backup
History of VoiceXML
Source:VoiceXML forum(http://www.voicexml.org)
Voice
C: Stock Services, how may I help you? Application H: Uh, whats Lucent trading at?
y
Capturing speech (analog) signals y Digitizing the sound waves, converting them to basic language units or phonemes, y Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
y
Speech Recognition
Speech Synthesis, or text-to-speech, is the process of converting text into spoken language.
Breaking down the words into phonemes; Analyzing for special handling of text such as numbers, currency amounts. Generating the digital audio for playback.
Speech Synthesis
VoiceXML Gateway(detail)
Writing a VoiceXML application is programming. y Control constructs are procedural (if-else etc.) y VoiceXML platform iterates through a <form> until values for all field items have been collected
y
Programming VoiceXML
Speech synthesis (TTS) Speech recognition (SR) Speech grammars Voice Biometrics
Call centre
The FIA has a main loop that repeatedly selects a form item and then visits it The first (in document order) form item, whose field item variable is undefined, is selected As a result, the user is prompted for each field item in turn
<form> <prompt>Where do you want to go to and how many are travelling ?</prompt> <field name=dest-city"> <prompt>Where do you want to go to?</prompt> <grammar mode=voice src=./cities.grxml/> </field>
Field item 1
<field name="travellers> <prompt>How many are travelling to your destination?</prompt> Field item 2 <grammar mode=voice src=./number.grxml/> </field> <!-- other FIA fields --> Example Form </form>
Developed by Sun and SpeechWorks, as a markup language for text-to-speech dialogs. Based on the Java Speech API Markup Language http://java.sun.com/products/java-media/speech/ Text annotation to provide hints to speech synthesizers
Aimed at making TTS speech more natural, more understandable
Feature set:
hints to word pronunciation hints to phrasing, emphasis, pitch and speaking rate marker elements -- notifications from the speech synthesizer to applications when marker is reached.
Developed by Sun and SpeechWorks, as a syntax for expressing speech grammars Based on the Java Speech Grammar API Grammar Format http://java.sun.com/products/javamedia/speech/
A lightweight set of tags designed to be used with HTML and XHTML to enable lightweight telephony applications driven from regular Web documents. y Targeted at supporting multimodal access
y
Microsofts SALT