Sie sind auf Seite 1von 53

Speech Technologies and VoiceXML

Guided byMr R.P.Ojha

y y

Voice technologies Backgrounds


ASR/TTS

Voice browsing with VoiceXML y VoiceXML architecture y VoiceXML Programming y Future of VoiceXML y Summary

Presentation Agenda

In the mid- to late 1990s, personal computers started to become powerful enough to support ASR y The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).
y

Voice Technologies

Speech Recognition
)

Speech Synthesis

E-business has changed from client-server model to web-centric model y Once connect to the Internet,one can get any information he want. But people wants more convenient way to connect to Internet. y Lou Gerstner,CEO of IBM:Pervasive Computing Model is billion people interacting with million e-business with trillion devices interconnected.
y

Pervasive Computing Model

VoiceXML instead of HTML y A voice browser instead of an ordinary web browser y Phone instead of PC.
y

Voice Browsing

Speech Input: speech recognition and DTMF y Speech Output: pre-recorded audio and synthesized speech y Internet: XML, IP, HTTP, SSL, JavaScript y Telephony: call transfer, data passing
y

VoiceXML Key Design Issues

Founded May 1999 y 60 company members y Mission Standards group to prepare and review markup languages to enable internet-based speech applications y http://www.w3.org/Voice
y

W3C Voice Browser Working Group

Industry Group to promote VoiceXML y 550+ member companies y Submitted VoiceXML 1.0 to W3C in May 2000 y http://www.voicexml.org
y

VoiceXML Forum

VoiceXML v1.0 (May 2000)


VoiceXML Forum Specification submitted to the W3C

VoiceXML v2.0
W3C Voice Browser Working Group 50+ members collaborating Addressed 400+ change requests

y y

y y y

A language for specifying voice dialogs. Voice dialogs use audio prompts and text-tospeech (TTS) for output; touch-tone keys (DTMF) and automatic speech recognition (ASR) for input. Main input/output device (initially) is the phone. Leverages the Internet for application development and delivery. Standard language enables portability.(VoiceXML Dialog )

VoiceXML Overview

VoiceXML Platform Architecture

Telephone and Platform VoiceXML Telephone networkConnects callers telephone with Architecture-1 ArchitectureTelephony Server y VoiceXML Gateway
y

Voice Browser Audio input-Speech Recognition (ASR), Touchtone (DTMF), Audio recording. Audio output-Audio playback, Speech Synthesis (TTS) Interface, Call Controls

VoiceXML Platform Architecture-2 Architecturey

VoiceXML Documents
Dialog and flow control Client-side scripting (ECMAScript) Speech Recognition grammar Speech Synthesis pronunciation control

Document servers(web server)


Feeding Static VoiceXML documents or audio files.

Application servers
Generate VoiceXML documents dynamically. Server-side application logic Connect to Database, or database interface

Example and weather.jsp - VoiceXML


JSP
<% user.storePreference( try) %> <form> <block> <%= weather.getTemp() %> </block> </form> <form> <block> 25 </block> </form>

VoiceXMLbrowser

DB

Web server+ Servlet/JSP engine

Voice Gateway

y In Taiwan: Implementations of VoiceXML Yes Mobile GatewaysTelecom Laboratories ( Chunghwa )

eWings Technologies, Inc


y y

Free
IBM VoiceServerSDK

Open Source
CMU:OpenVXI

[DEMO] A Simple VoiceXML Application

DEMO
A Simple VoiceXML application to introduce the department of Computer Science . y Exp. show that to build a corresponding HTML version first is helpful.
y

Document

A VoiceXML document defines one or more dialogs The user is always in one dialog at any time Each dialog specifies the next dialog to transition to using a URL

doc1.vxml Dialog 1
Transition: #dialog 2

Dialog 2

Transition: http://xyz.com/doc2.vxml

A Dialog describes an interaction between a user and the system y Two kinds of dialogs: form and menu
y

Dialog

VoiceXML Document Structure.

 Form

Grammar

filed

<form> <field name="travellers> <grammar mode=voice src=./number.grxml/> <prompt>How many are travelling?</prompt> <filled> <submit next=http://travel.com/order/> </filled> </field> </form>

input output

eval

Form

<menu id=commands> What service would you like? <choice next=/cars> <choice next=/news> </menu> Car hire Todays news </choice> </choice> <choice next=/hotels> Hotel reservations </choice>

 menu  menu

form user URL

Menu

Typically used to send results from client to server y Syntax: <submit next=URI namelist=var1 var2 .../> y namelist: Fields
y

Submit

Submit, Example

<form> <field name=dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=voice src=./cities.grxml/> </field> <field name="travellers> <prompt> How many are travelling to <value expr="city"/>? </prompt> <grammar mode=voice src=./number.grxml/> </field> <filled> Thank you. Your order is now being processed. <submit next="http://travel.com/order" namelist=dest-city travellers"/> </filled> </form>

Variables can be manipulated and referenced

: <field name="user2"> : <assign name="user1" expr=peter"/> : <clear namelist="user1 user2"/> : How many are travelling to <value expr=dest-city/> ?
x $

Variables

session application document dialog

Session variables are read-only variables provided by the interpreter context

Variable Scope Search for variable name

Scope defined by element containing executable content (<block>, <filled> or event handler)

:Events
Events are used to signal unexpected situations y Events are caught by an catch event handler
y

<catch event=com.acme.mailreader>...</catch> <catch event=nomatch noinput>...</catch> Shortcut: <nomatch> is equivalent to <catch event="nomatch"> Other shortcuts: <noinput>, <error>

<field name=dest-city"> <prompt> Where do you want to go to? </prompt> <grammar mode=voice src=./cities.grxml/> <nomatch> Please say the city you want to fly to. </nomatch> </field>

Events, Example

xHTML + VoiceXML y SALT


y

Multimodal Web Browsing

[DEMO] Multimodal Browsing

Sun/SpeechWorks (1999)

W3C

JSML JSGF
VoiceXML forum (2000) W3C (2003 in CR)

VoiceXML 3? Speech synthesis (SSML) Speech reco. grammar Speech semantics NLP Pronunciation lexicon [early] Call control [early] [early]

VoiceXML 1.0

VoiceXML 2.0

Future of the Voice web and Microsoft-led (2002) VoiceXML


SALT
Speech Application Language Tags

Voice Browser interoperation

Speech is the most natural way for human to communicate thus it will become an important way in HCI. y VoiceXML has revolutionized speech recognition & telephony application development & deployment.
y

Conclusion

Q&A

Backup

History of VoiceXML
Source:VoiceXML forum(http://www.voicexml.org)

Show : VoiceXML in Daily Life

Classification of Voice Basic interactive voice response (IVR)


Computer: For stock quotes, press 1. For Application trading, press 2. Human: (presses DTMF 1)

Basic speech ASR


C: Say the stock name for a price quote. H: Lucent Technologies

Advanced speech ASR Classification of

Voice

C: Stock Services, how may I help you? Application H: Uh, whats Lucent trading at?
y

Near-natural language ASR


C: How may I help you? H: Um, yeah, Id like to get the current price of Lucent Technologies C: Lucent is up two at sixty eight and a half. H: OK. I want to buy one hundred shares at market price. C:

Capturing speech (analog) signals y Digitizing the sound waves, converting them to basic language units or phonemes, y Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
y

Speech Recognition

Speech Synthesis, or text-to-speech, is the process of converting text into spoken language.
Breaking down the words into phonemes; Analyzing for special handling of text such as numbers, currency amounts. Generating the digital audio for playback.

Speech Synthesis

VoiceXML Gateway(detail)

Writing a VoiceXML application is programming. y Control constructs are procedural (if-else etc.) y VoiceXML platform iterates through a <form> until values for all field items have been collected
y

Programming VoiceXML

VoiceXML System Components


PBX Telecom boards VoiceXML server Software utilities

Speech synthesis (TTS) Speech recognition (SR) Speech grammars Voice Biometrics

Call centre

VoiceXML servers serve as integrators of various hardware and software


CT Integration

The FIA has a main loop that repeatedly selects a form item and then visits it The first (in document order) form item, whose field item variable is undefined, is selected As a result, the user is prompted for each field item in turn

FIA - Form Interpretation Algorithm

<form> <prompt>Where do you want to go to and how many are travelling ?</prompt> <field name=dest-city"> <prompt>Where do you want to go to?</prompt> <grammar mode=voice src=./cities.grxml/> </field>

Field item 1

<field name="travellers> <prompt>How many are travelling to your destination?</prompt> Field item 2 <grammar mode=voice src=./number.grxml/> </field> <!-- other FIA fields --> Example Form </form>

<form> ... <filled> <if cond="travellers > 10">


Sorry, we cannot handle groups larger than 10 persons

<clear namelist="travellers"/> <elseif cond="travellers > 5 && dest-city == 'London'"/>


Sorry, we cannot handle groups larger than 5 persons travelling to London

<clear namelist=city travellers"/> <else/> <submit next="http://travel.com/order"/> </if> </filled> </form>

if, else and elseif

Developed by Sun and SpeechWorks, as a markup language for text-to-speech dialogs. Based on the Java Speech API Markup Language http://java.sun.com/products/java-media/speech/ Text annotation to provide hints to speech synthesizers
Aimed at making TTS speech more natural, more understandable

Feature set:
hints to word pronunciation hints to phrasing, emphasis, pitch and speaking rate marker elements -- notifications from the speech synthesizer to applications when marker is reached.

JSML - JSpeech Markup Language

Developed by Sun and SpeechWorks, as a syntax for expressing speech grammars Based on the Java Speech Grammar API Grammar Format http://java.sun.com/products/javamedia/speech/

JSML - JSpeech Grammar Format

Speech Application Language Tags


Microsoft, Cisco, Intel, Comverse, SpeechWorks, Philips

A lightweight set of tags designed to be used with HTML and XHTML to enable lightweight telephony applications driven from regular Web documents. y Targeted at supporting multimodal access
y

Microsofts SALT

Das könnte Ihnen auch gefallen