Voice Browsing and Multimodal Interaction in 2009

Voice Browser and Multimodal Interaction In 2009
Paolo Baggia
Director of International Standards
March 6th, 2009
Google TechTalk
Google TechTalk – Mar 6th, 2009 Paolo Baggia 11

Overview
A Bit of History
W3C Speech Interaction Framework Today

ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions
W3C Multimodal Interaction Today

MMI Architecture
EMMA and InkML
A language for Emotions
Next Future
Company Profile
Privately held company (fully owned by Telecom Italia), founded in 2001 as

spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and
expertise in voice processing.
Global Company, leader in Europe and South America for award-winning, high
quality voice technologies (synthesis, recognition, authentication and
identification) available in 26 languages and 62 voices.
Multilingual, proprietary technologies protected
over 100 patents worldwide Munich
London
Financially robust, break-even reached in 2004,
revenues and earnings growing year on year
Paris
Growth-plan investment approved for
the evolution of products and services. Madrid
Offices in New York. Headquarters in Torino, Torino
local representative sales offices in Rome, New York

Rome
Madrid, Paris, London, Munich
Flexible: About 100 employees, plus a
vibrant ecosystem of local freelancers.
International Awards
“2008 Frost & Sullivan European Telematics and Infotainment

Emerging Company of the Year” Award
Winner of “Market leader-Best Speech Engine” Speech

Industry Award 2007 and 2008
Loquendo MRCP Server: Winner of 2008 IP Contact

Center Technology Pioneer Award
“Best Innovation in Automotive Speech Synthesis” Prize

AVIOS-SpeechTEK West 2007
“Best Innovation in Expressive Speech Synthesis” Prize

AVIOS-SpeechTEK West 2006
“Best Innovation in Multi-Lingual Speech Synthesis”

Prize AVIOS-SpeechTEK West 2005

A Bit of History

Standard Bodies
Two main standard bodies:
W3C – World Wide Web Consortium
Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full
potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).
400 members all over the world, 50 Working, Interest and Coordination Groups.
W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM,
SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web
Accessibility, Device Independence)
IETF – Internet Engineering Task Force
Founded in 1986, but growth in 1991as Internet Society. 1300 members.
HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP)
is very relevant for speech platforms.
Two industrial forums:

VoiceXML Forum (www.voicexml.org)
Inventors of VoiceXML 1.0, then submitted to W3C for standardization.
Current goal is to promote, disseminate and support VoiceXML and related standards.
SALT Forum (www.saltforum.org)
Supported by Microsoft to define a lightweight markup for telephony and multimodal
applications.
Other relevant bodies:

3GPP, OMA, ETSI, NIST

The (r)evolution of VoiceXML
1998 - 2004
W3C charters
W3C charters
Voice Browser
Multimodal Interaction
WG
WG
EMMA 1.0
VoiceXML By Cisco, Comverse,
SALT Forum W3C Rec
Forum Birth Intel, Microsoft, Philips,
Birth SpeechWorks, PLS 1.0
By AT&T, IBM, W3C REC
Lucent, Motorola, 2007
1998 2000 2004
2008 2009
1999 2002
SSML 1.0
W3C Voice W3C1.0
SRGS Rec SISR 1.0
Browser VoiceXML 1.0 W3C 2.0
VoiceXML Rec VoiceXMLRec
W3C 2.0
Workshop Released W3C Rec W3C Rec
Preparing to announce VoiceXML 1.0

Friday Feb. 25th, 2000
Lucent, Naperville, Illinois
Left to right: Gerald Karam (AT&T), Linda Boyer (IBM),

Ken Rehor (Lucent), Bruce Lucas (IBM),
Pete Danielsen (Lucent), Jim Ferrans (Motorola),
Dave Ladd (Motorola).

Speech Interface Framework in 2000
(by Jim Larson)
Semantic Interpretation for

Speech Recognition (SISR)
N-gram Grammar ML VoiceXML 2.1

EMMA
Speech Recognition Natural Language
Grammar Spec. (SRGS) VoiceXML 2.0
Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Pronunciation Lexicon Dialog

Specification (PLS) Manager
User Pre-recorded Audio Player

Telephone
Media System
Planning
Language
TTS
Generation
Speech Synthesis Reusable Components Call Control XML

Markup Language (SSML) (CCXML)

Speech Interface Framework - Today
(by Jim Larson)


EMMA 1.0

Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web


Telephone
Media System
Planning
Language
TTS
Generation


Speech Interface Framework - End of 2009
(by Jim Larson)


EMMA 1.0
Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web


Telephone
Media System
Planning
Language
TTS
Generation


W3C Process

Architectural Changes
Traditional (proprietary) architecture
ASR / DTMF
Speech Proprietary
User SCE
Applic.
TTS / Audio
Proprietary
platform
.grxml/.gram, .pls
VoiceXML architecture
ASR / DTMF
.vxml
VoiceXML Web
User
Browser HTTP Applic.
TTS / Audio
VoiceXML
platform
.ssml, .wav/.mp3, .pls

The VoiceXML Impact
VoiceXML changed the landscape of IVRs and speech application

creation
From proprietary to standard-based speech applications
Before After
• Proprietary platforms • Standard VoiceXML
(HW & SW) platforms
• Proprietary • Standards for Speech
applications (by Technologies
proprietary SCE) • Standard tools for
• Mainly DTMF and VoiceXML applications
pre-recorded prompts • Integration of DTMF
• First attempts to add and ASR
speech into IVR • Still predominance of
DTMF, but more and
more speech
applications

Overview
A Bit of History

ASR/DMTF
TTS
Lexicons

MMI Architecture
EMMA and InkML
Next Future
Standards for ASR and DTMF
SRGS 1.0, SISR 1.0

W3C Standards for Speech/DTMF Grammars
SYNTAX SEMANTICS
Defines constraints on Speech Describes how to
admissible sentences for grammar produce results after
a specific recognition turn an utterance is recognized
SRGS
SRGS SISR
SISR
ABNF
ABNF XML
XML literal
literal script
script
voice
voice dtmf
dtmf
http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/

SRGS/SISR Grammars for “Torino”
SRGS XML SRGS ABNF
<?xml version="1.0" encoding="UTF-8"?>

<grammar xml:lang="en-US" version="1.0"
xmlns="http://www.w3.org/2001/06/grammar" #ABNF 1.0 iso-8859-1;
tag-format="semantics/1.0-literals">
SISR <rule id="main" scope="public">
mode voice;
tag-format <semantics/1.0-literals>;
<token>Torino</token>
literal <tag>10100</tag>
public $main = Torino {10100} ;
</rule>
</grammar>

<grammar xml:lang="en-US" version="1.0"
xmlns="http://www.w3.org/2001/06/grammar #ABNF 1.0 iso-8859-1;
" tag-format="semantics/1.0"> mode voice;
SISR <tag>var unused=7;</tag>
tag-format <semantics/1.0>;
<rule id="main" scope="public">

script <token>Torino</token>
<tag>out="10100";</tag>
{var unused=7;};
public $main = Torino {out="10100";} ;
</rule>
</grammar>

SRGS/SISR Standards – Pros
Powerful syntax (CFG) and very powerful semantics (ECMA)

DMTF and Voice input are transparent to the application
Wide and consistent adoption among technology vendors
Two syntax XML and ABNF are great!

Developers can choose (XML validation vs. compact format)
Transformations are possible

XML ABNF (easy, simple XSLT)
ABNF XML (requires a ABNF parser)
Open Source tools might be created to:

Validate grammar syntax
Transform grammars
Debug grammars on written input
Coverage tests: explode covered sentences, GenSem, SemTester, etc.

SRGS/SISR Standards – Small Issues
Semantics declaration: tag-format attribute

If value “semantics/1.0”?
Mandate SISR Script semantics inside semantic tags
If value “semantics/1.0-literal”?
Mandate SISR Literal semantics inside semantic tags
If missing?
Unclear! Risk of interoperability troubles
SISR Script Semantics

Clumsy default assignment: returns last referenced rule only
Developer must properly pop-up results
Be careful to redefine “out”
Assign a scalar value might result in errors
SISR Literal Semantics

Only useful for very simple word-list rules
No support for encapsulating rules
SISR Literal grammars as external references ONLY!

SRGS/SISR – Encapsulated Grammars
Gr2.gram
Literal
Gr1.grxml Gr41.grxml
Literal
Script
Gr3.grxml
Script
Gr42.gram
Script

SRGS/SISR Standards – Rich XML Results
Section 7 of SISR 1.0 specification
http://www.w3.org/TR/semantic-interpretation/#SI7
Serialization rules from SISR ECMA results into XML
Edge cases:
Arrays
Special variable “_attribute” and “_value”
Creation of namespaces and prefixes
{
drink: {
_nsdecl: {
_prefix:"n1",
_name:"http://www.example.com/n1"
},
_nsprefix:"n1",
liquid: {
_nsdecl: {
_prefix:"n2", <n1:drink xmlns:n1="http://www.example.com/n1">
_name:"http://www.example.com/n2" <liquid n2:color="black“
}, xmlns:n2="http://www.example.com/n2">coke</liquid>
_attributes: { <size>medium</size>
color: { </n1:drink>
_nsprefix:"n2",
_value:"black"
}
},
_value:"coke"
},
size:"medium"
}
}

SRGS/SISR Standards – Next Steps
Adoption of the PLS 1.0 lexicon

Clear entry point into PLS lexicons, <token> element
Missing role attribute in <token> to allow homographs
disambiguation
Next extensions via Errata

XML 1.1 support and IR
Update normative references
No Major Extensions are needed!

Speech Synthesis
SSML 1.0/1.1

TTS – Functional Architecture and
Markup/Non-Markup support
Text-to-
Structure Text Prosody Waveform
Phoneme
Analysis Normalization Analysis Production
Conversion
Markup support: Markup support:

<phoneme>, <lexicon> Markup support:
<p>, <s>
Non-Markup support: <voice>, <audio>
Non-Markup support:
look up in pronunciation Non-Markup support:
infer the structure by
automatic text analysis dictionary
Markup support:
Markup support: <emphasis>, <break>, <prosody>
<say-as> for date, time, phone number, numbers Non-Markup support:
<sub> for acronyms and transliterations automatically generate prosody through analysis of
Non-Markup support: document structure and sentence syntax
automatically identify and convert constructs
http://www.w3.org/TR/speech-synthesis/
SSML 1.0 – Language description (I)
version attribute
Document Structure SSML namespace attribute
<speak> root element
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<p>I don't speak Japanese.</p>
<p xml:lang="ja">Nihongo-ga wakarimasen.</p>
Languages </speak>
Processing and Pronunciation

– <p> and <s> (paragraph and sentence)
to give a structure to the text
– <say-as> element
to indicate the type of text construct contained within the element
ex. date, numbers, etc.
– <phoneme> element
to provides a phonetic pronunciation for the contained text in IPA
– <sub> element
to provide substitutions for expanding acronyms in sequence of
words
SSML 1.0 – Language description (II)
Style
- <voice> element
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
The moon is raising on the beach, when John says,

looking Mary in the eyes:
<voice name="simon">I love you!</voice>
but she suddenly replies:
<voice name="susan"> Please, be serious! </voice>
</speak>
Other voice selection attributes are:

name, xml:lang, gender, age, and variant
- <emphasis> element
requests that the contained text be spoken with emphasis
level attribute can set it to strong, moderate, reduced, or none
- <break> element
controls the pausing between words
time attribute with two kind of values:
Time expressions “5s”, “20ms”
strength attribute with values:
none, x-weak, weak, medium (default value), strong, or x-strong
SSML 1.0 – Language description (III)
Prosody
<prosody> element
permits control of the pitch, speaking rate and volume of the
speech output.
The attributes are:

volume: the volume for the contained text.
rate: the speaking rate in words-per-minute for the contained text.
duration: a value in seconds or milliseconds for the desired time to take
to read the element contents.
pitch: the baseline pitch for the contained text.
range: the pitch range (variability) for the contained text in Hertz.
contour: sets the actual pitch contour for the contained text.
Other elements
<audio> element - to play an audio file
<mark> element - to place a marker into the text/tag sequence
<desc> element - to provide a description of a non-speech audio
source in <audio>
Towards SSML 1.1 – Motivations
Internationalization needs:
Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
Results:
No major needs for Eastern and Western European languages
Many issues for Far East languages (Mandarin, Japanese, Korean)
Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many
Indian languages
Mark input with or without vowels
Mark the transliteration schema used for input
Extensions required by Voice Browser:

More powerful error handling, selection of fall-back strategies
Trimming attributes
Volume attribute to adopt a logarithmic scale (before was linear)
Alignment with PLS 1.0 specification for user lexicons

http://www.w3.org/TR/speech-synthesis11/
SSML 1.1 – Language Changes
<w> element
Lexicon extensions
<lookup> element
permits control of the pitch, speaking rate and volume of the
speech output.
Phonetic Alphabet Registry creation and adoption

"ipa" for International Phonetic Alphabet
Registering policy for other phonetic alphabets, similar to LTRU for
Language tags
Candidates:
PinYin for Mandarin Chinese
JEITA for Japanese
X-SAMPA, ASCII transliteration of IPA codes
Pronunciation Lexicon
PLS 1.0

Pronunciation Lexicons
Pronunciation Lexicon
A mapping between words (or short phrases), their written representations,
and their pronunciations suitable for use by an ASR engine or a TTS
engine
Pronunciation lexicons are not only useful for voice browsers

They have also proven effective mechanisms to support accessibility for the
differently able as well as greater usability for all users
They are used to good effect in screen readers and user agents supporting
multimodal interfaces
The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is

designed to enable interoperable specification of pronunciation
lexicons
http://www.w3.org/TR/pronunciation-lexicon/
PLS 1.0 – Language Overview
A PLS document is a container (<lexicon>) of several lexical entries

(<lexeme>)
Each lexical entry contains

One or more spellings (<grapheme>)
One or more pronunciations (<phoneme>) or substitutions (<alias>)
Each PLS document is related to a single unique language (xml:lang)
SSML 1.0 and SRGS 1.0 documents can reference one or more PLS
documents
Current version doesn’t include morphological, syntactic and semantic

information associated with pronunciations
PLS 1.0 – An Example

<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/pronunciation-lexicon/pls.xsd"
alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>Sepulveda</grapheme>
<phoneme>səˈˈpȜ
ȜlvǺǺdə</phoneme>
</lexeme>
<lexeme>
<grapheme>W3C</grapheme>
<alias>World Wide Web Consortium</alias>
</lexeme>
</lexicon>
PLS 1.0 – Used for TTS
SSML 1.0
<speak version="1.0" … xml:lang="en-US">
<lexicon uri="http://www.example.com/SSMLexample.pls"/>
The title of the movie is: "La vita è bella" (Life is beautiful),
which is directed by Benigni.
</speak>
PLS 1.0
<lexicon version="1.0" … alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>La vita è bella</grapheme>
<phoneme>ˈˈlǡǡ ˈviːːȎə ˈȤeǺǺ ˈbǫǫlə</phoneme>
</lexeme>
<lexeme>
<grapheme>Benigni</grapheme>
<phoneme>bǫǫˈniːːnji</phoneme>
</lexeme>
</lexicon>
PLS 1.0 – Used for ASR
SRGS 1.0
<grammar version="1.0“ xml:lang="en-US" root="movies" mode="voice">
<lexicon uri="http://www.example.com/SRGSexample.pls"/>
<rule id="movies" scope="public">
<one-of>
<item>Terminator 2: Judgment Day</item>
<item>Pluto's Judgement Day</item>
</one-of>
</rule>
</grammar>
PLS 1.0
<lexicon version="1.0" … alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>judgment</grapheme>
<grapheme>judgement</grapheme>
<phoneme>ˈˈdʒȜ
Ȝdʒ.mənt</phoneme>
</lexeme>
</lexicon>
Examples of Use
Multiple pronunciations for the same orthography
Multiple orthographies
Homophones
Homographs
Acronyms, Abbreviations, etc.
Detailed descriptions can be found in:

W3C specification, Wikipedia
Paolo Baggia, SpeechTEK 2008 & Voice Search 2009

PLS 1.0 – Open Issues
No wide support of IPA in speech engines

Slowly changes are under way
Phonetic Alphabet Registry will open doors to other alphabets in a
controlled and interoperable way
Integration in ASR/TTS
SSML 1.1 will interoperate with PLS 1.0
SRGS 1.0 still missing support of role attribute for PLS 1.0
No matching algorithm inside PLS, because it is mainly a data

format
Pronunciation Alphabets
IPA, SAMPA

International Phonetic Alphabet
Pronunciation is represented by a phonetic alphabet

Standard phonetic alphabets
International Phonetic Alphabet (IPA)
Well known phonetic alphabet
SAMPA - ASCII based (simple to write)
Pinyin (Chinese Mandarin), JEITA (Japanese), etc.
Proprietary phonetic alphabets
International Phonetic Alphabet (IPA)

Created by International Phonetic Association (active since 1896),
collaborative effort by all the major phoneticians around the world
Universally agreed system of notation for sounds of languages
Covers all languages
Requires UNICODE to write it
Normatively referenced by PLS

IPA – Chart
IPA was founded in 1886
It is the major international
association of phoneticians
The IPA alphabet provides
symbols making possible the
phonemic transcription of all
known languages
IPA characters can be encoded in

Unicode by supplementing
ASCII with characters from
other ranges, particularly:
IPA extensions (0250–02AF)
Latin Extended-A (0100-017F)
See the detailed:
http://www.unicode.org/charts

Phonetic Alphabets – Issues
The real problem is how to write pronunciation in a reliable, unless

you are trained phonetician
Issues with fonts and authoring, browsers, but Unicode fonts today
support IPA extensions, see:
http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm
There are very few tools to help writing pronunciations and to let
you listen to what you have written
Make available pronunciations in IPA or other general phonetic

languages.

Voice Dialog languages:
VoiceXML 2.0
VoiceXML 2.1

VoiceXML 2.0 – Features, Elements
Menus, forms, sub-dialogs Events

<menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>,
Input <catch>, <throw>
Speech recognition Transition and submission
<grammar> <goto>, <submit>
Recording Telephony
<record>
Connection control
Keypad <transfer>, <disconnect>
<grammar mode="dtmf">
Telephony information
Output
Platform specifics
Audio files <object>
<audio>
Text-To-Speech Performance
<prompt> Fetch
Variables (ECMA-262) Properties
<var>, <assign>, <script>
scoping rules
http://www.w3.org/TR/voicexml20/
VoiceXML 2.0 – Execution Model
Execution is synchronous
Only disconnect event is handled (somewhat) asynchronous
Execution is always in a single dialog: <form> or <menu>

Form Interpretation Algorithm for <field> selection
Prompt are queued

Played only when encountering a waiting state
Played before a fetchaudio is started
Processing is always in one of two states:

Waiting for input in an input item:
<field>, <record>, <transfer>, etc.
Transitioning between input items in response of an input
Event-driven:
<nomatch>, <noinput> user’s input event handling
<catch>, <throw> generalized event mechanism
connection.* call event handling
error.* error event handling
VoiceXML 2.1 – Extended Features
Dynamically referencing grammars and scripts:
<grammar expr="…">, <script expr="…">
Record user’s utterance during form filling

recordutterance property
Add new shadow variables: recording, recordingsize, recordingduration
Detect barge-in during prompt playback (SSML <mark>)

Add markexpr attribute
Add new shadow variables: markname and marktime
Fetch XML data without transition

Use read-only subset of DOM
Dynamically concatenate prompts <foreach>
Iterate throught ECMAScript arrays and execute content
Send data upon disconnect

<disconnect namelist="…">
Additional transfer type
<transfer type="consultation">
VoiceXML Applications
Static VoiceXML applications

The VoiceXML page is always the same, so the user experience
No personalization or customization
Dynamic VoiceXML applications

User experience is customized
• After authentication (PIN)
• Using caller-id or SIP-id
Data driven
Dynamic pages generated at runtime
e.g. JSP, ASP, etc.
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
A Drawback of VoiceXML 2.0
A drawback of VoiceXML is that the transition from a VoiceXML page

to another is a costly activity:
Fetch the new page, if not cached
Parse the page
Initialize the context, possibly loading and initializing a new application
root document
Load or pre-compile scripts
The transitions are the only way to return data to the Web Application
(if the VoiceXML is dynamic)
Pages must be created to include dynamic data
VoiceXML 2.1 addresses part of this drawback by feeding dynamic

data to a running VoiceXML page
Advantages of VoiceXML 2.1 - AJAX
Two of the eight new features in VoiceXML 2.1 helps to create

more dynamic VoiceXML applications:
<data> element
<foreach> element
Static VoiceXML document can fetch user-specific data at runtime,

without changing the VoiceXML document
<data> element allows retrieval of arbitrary XML data without
VoiceXML document transitions
Returned XML data are accessible by a subset of DOM primitives
<foreach> extend the prompts to allow the iteration on a dynamic
array of information to create a dynamic prompt
This is similar to AJAX programming for HTML services

It decouples presentation layer (VoiceXML) from business logic
(accessed via <data>)
VoiceXML 2.1 – <data> Element
Attributes:
name the variable to be filled with the DOM of the retrieved data
scr or srcexpr the URI of the location of the XML data
namelist the list of variables to be submitted
method either ‘get’ or ‘post’
enctype media encoding
fetch and caching attributes
As <var>, it may appear in executable content (<form> and <vxml>)

The value of name must be a declared variable
The platform will fill the variable of the DOM of the fetched XML data
<data> element is synchronous (the service stops to get data)
VoiceXML 2.1 – <foreach> Element
Attributes:
array ECMAScript expression that must evaluate to ECMAScript array
item the variable that stores the element to be processed
<foreach> allows the application to iterate on an ECMAScript array and

to execute the content
<foreach> may appear:
In executable content (all executable content elements may appear as
content of <foreach>)
In <prompt> (restrictions on the content are applied)
<foreach> allows sophisticated concatenation of prompts
VoiceXML – Final Remarks
The changed landscape for speech application development:

Virtually all the IVRs today support VoiceXML
New options related to VoiceXML:
SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)
Large hosting of speech applications (TellMe, Voxeo)
Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)
Further changes may come from the CCXML adoption
… but:
Mainly system driven applications are actually deployed
New challenges to incorporate more powerful dialog strategies,
mixed-initiative are under discussion.
VoiceXML Resources
Voice Browser Working Group (spec, FAQ, implementations, resources):

http://www.w3.org/Voice/
VoiceXML Forum site (resources, education, interest groups):

http://www.voicexml.org/
VoiceXML Forum Review:
http://www.voicexmlreview.org/
Interesting articles related to VoiceXML and more
Example code in the sections "First Words" and "Speak & Listen"
Ken Rehor’s World of VoiceXML

http://www.kenrehor.com/voicexml
Online documentation related to VoiceXML Platforms

Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie
Many books on VoiceXML:
Jim Larson, "VoiceXML Introduction to Developing Speech Applications", Prentice-Hall,
2002.
A. Hocek, D. Cuddihy, "Definitive VoiceXML", Prentice-Hall, 2002

Call Control:
CCXML 1.0

CCXML 1.0 – Highlights
Asynchronous event processing
Acceptance or refusal of an incoming call
Different type of transfer call management
Outbound call activation (interaction with an external entity)
Use of ECMAScript adding scripting capabilities to call control

applications
VoiceXML modularization
Conferencing management

CCXML 1.0 – Elements Relationship

CCXML 1.0 – Incoming Call
CCXML document
Event catching and processing
<?xml version="1.0"
encoding="UTF-8"?>
<ccxml version="1.0">
[…]
CCXML <transition
connection.alerting event="connection.alerting">
Interpreter
[…]
</transition>
event$ <transition
event="connection.disconnected">
[…]
name:’connection.alerting’;
</transition>
connectionid:‘0239023901903993’;
eventid:’00001’; ....
…..
http://www.w3.org/TR/ccxml

CCXML 1.0 – connection.alerting Event
Basic telephony information has been retrieved on alerting event and

is available into CCXML document:
Local URI, remote URI, protocol used, redirection info, etc.
Based on certain checked info, CCXML can accept or refuse the

incoming call, even before contacting the dialog server;
Any error that can occur during the phone call can be managed by
CCXML service (connection.failed, error.connection events)
Call Control CCXML VoiceXML

Adapter Interpreter Interpreter
connection.alerting
Analyzing events$ content

<accept/> | <reject/>

CCXML 1.0 – How to activate a new dialog
CCXML actions:
Receives alerting event from Call Control Adapter
Asks to dialog server to prepare a new dialog
Waits for the preparation
If the dialog has been successfully prepared, accept the call
Asks to dialog server to start the prepared new dialog

alerting
prepare a new dialog
dialog prepared
call accepted
connected
start the prepared dialog
dialog started

Call transfer
CCXML supports transfer call of different modality: "bridge", "blind",

"consultation";
Based on different modalities features CCXML language allows the expected
interaction with the Call Control Adapter to correctly perform the transfer;
During the different phases of transfer call creation the CCXML can receive
any asynchronous event and correctly manage it, interrupting the call, if
requested

Performing a transfer
command1
answer1
[…]
transfer complete …

External Events
CCXML Interpreter Context can receive events from an external entity

able to use the HTTP protocol;
Events generated in this way must be sent to a CCXML by a POST
HTTP command
A event is so performed and:
It can be addressed on a new session whose creation must be requested
It can be addressed on an existent session, specifying the ID in the
request
CCXML External
Interpreter Entity
basic http event
Event
management
Event management result

External event on a new session:
the Outbound Call
A particular request arrived to Call Control from an external entity;

A particular CCXML service associated with the received event is started and
a set of operations between Call Control Adapter, Call Control and Dialog
Server is activated: the outbound call is so placed
outbound call request

Create a call
connection progressing …
Prepare a dialog
prepared
connection connected
Start the prepared dialog

External event on a session:
dialog termination request
An external entity performs a HTTP POST request towards the CCXML
Interpreter Context, specifying a sessionid, requesting the termination of a
particular dialog;
The CCXML check the session id, if this is valid then CCXML Interpreter
injects the event received in the session;
The CCXML service has a transition on that event and performs the dialog
termination on a particular dialog identifier;
Dialog termination request

It depends on dialogterminate (dialogid)

dialog.exit event
management
dialog.exit
disconnect(connId) dialogprepare

Loading different CCXML documents:
<fetch> and <goto> elements
<fetch> and <goto> elements are used respectively to asynchronously fetch

content identified by the attributes of the <fetch> and to go in a fetched
document, if it’s successfully loaded;
CCXML - MODULARIZATION
Interpreter - SOURCE EXEMPLIFICATION
- MORE READABILITY
<fetch
next="'http://../Fetch/doc1.ccxml'"
type="'application/ccxml+xml'"
fetchid="result"/>
fetch the document "doc1.ccxml"
fetch.done / error.fetch
The first event occurred
in a new document
is ccxml.loaded
goto into the new document /
continue to work on the same dialog

Simple CCXML Document
<ccxml version="1.0" xmlns="http://www.w3.org/2002/09/ccxml">
<var name="currentState"/>
<var name="myDialogId"/>
<var name="myConnId"/>
<eventprocessor statevariable="currentState">
<transition event="connection.alerting">
<assign name="myConnId" expr="event$.connectionid"/>
<accept connectionid="event$.connectionid"/>
</transition>
<transition event="connection.connected">
<dialogstart src="'http://www.example.com/flight.vxml'"
connectionid="myConnId" dialogid="myDialogId"/>
</transition>
<transition event="dialog.started">
<log expr="’VoiceXML appl is running now’"/>
</transition>
<transition event="connection.disconnected">
<dialogterminate dialogid="myDialogId"/>
</transition>
<transition event="dialog.exit">
<disconnect connectionid="myConnId"/>
</transition>
<transition event="*">
<log expr="'Closing, unexpected:'+ event$.name"/>
<exit/>
</transition>
</eventprocessor>
</ccxml>

CCXML 1.0 – Next Steps
CCXML specification is a Last Call Working Draft, all the feature

requests and clarifications have been addressed;
An Implementation Report test suite is under development;
It is very close to be published as W3C Candidate Recommendation;
Internal or external companies will be invited to send implementation

report on their CCXML platform;
After that, CCXML 1.0 specification will be able to become Proposed

Recommendation and then W3C Recommendation.

Speech Interface Framework
Tour Complete!

Speech Interface Framework - End of 2009
(by Jim Larson)


EMMA 1.0
Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web


Telephone
Media System
Planning
Language
TTS
Generation


Architectural Changes
.grxml/.gram, .pls
VoiceXML architecture
ASR / DTMF
.vxml
VoiceXML Web
User
Browser HTTP Applic.
TTS / Audio
VoiceXML
platform
.ssml, .wav/.mp3, .pls

VoxNauta – Internal Architecture

Loquendo MRCP Server/LSS 7.0 Architecture
Load Balancer
RTSP SIP
(MRCPv1) MRCP v2
(SDP)
RTP RTSP Parser SIP MRCP v2

parser
MRCP v1 Parser SDP
Management Graphic
MP (SNMP)
Management
Configuration Consolle
Config files
AP
Interf. MRCP v1/v2 Server
Audio AP Logger Log files
Provider API
OS Win32/Linux
NLSML / EMMA
TTS & ASR interface
TTS and ASR API TTS and ASR API
LTTS LASR LASR-SV

IETF MRCP Protocols
Media Resource Control Protocol MRCP are IETF standards

MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on
RTSP/RTP
MRCPv2 is Internet Draft,
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP
offering the new audio recording and Speaker Verification
functionalities
Optimized client-server solution for the large-scale deployment of
speech technologies in the telephony field, such as call centers,
CRM, news and email-reading, self-service applications, etc.
Allows standard interface of speech technologies in all IVR platforms
For more information read:

Dave Burke, Speech Processing for IP Networks. Media
Resource Control Protocol (MRCP), ed. Wiley

VoiceXML in a Call Center
PBX
Fixed/
Mobile Optional
Network
Voice Gateway for
Non SIP PBX
VOXNAUTA IVR
ACD
WEB CTI Data

Server Server Server
Google TechTalk – Mar 6th, 2009 Operators Paolo Baggia 72

VoiceXML in the IMS Architecture
TDM protocols
VOICE SIP protocols
Fixed/ RTP
Mobile GATEWAY
Network VoiceXML on HTTPS
VOXNAUTA MRF
IP
Network
Application Server
Overview
A Bit of History

ASR/DMTF
TTS
Lexicons

MMI Architecture
EMMA and InkML
Next Future
Modes, Modalities and Technologies
Speech
Audio
Stylus
Touch
Accelerometer
Keyboard/keypad
Mouse/touchpad
Camera
Geolocation
Handwriting recognition
Speaker verification
Signature verification
Fingerprint identification
….

Complement and Supplement
Speech Visual
- Transient - Persistent
- Linear - Spatial
- Hands and Eyes-Free - Eyes
- Suffers Noise - Suffers Light Conditions
Enable to choose among different modalities or to mix

them
Adaptable to different social, environmental conditions or
to user preference

GUI VUI MUI
or
MMUI

MMI has an Intrinsic Complexity
Interaction
Manager
speech
speech
text fingerprint
fingerprint
text
mouse Face
Face
mouse
identification
identification
handwriting geolocation
geolocation
handwriting Speaker
Speaker
Vital verification
verification
accelerometer
accelerometer Vital
signs
signs
Sensor Identification
User intent
photograph video
video
photograph
drawing Audio
Audio
drawing recording
recording
Deborah Dahl, Voice Search 2009

Recording
MMI can Include Many Different Technologies
Touchscreen Accelerometer
Speech Interaction
Geolocation
recognition Manager
Fingerprint
Keypad
recognition
Handwriting
recognition

Uniform Representation for MMI
Getting everything to work together is complicated.

One simplification is to represent the same information
from different modalities in the same format.
The need a common language for representing the
same information from different modalities
EMMA (Extensible MultiModal Annotation) 1.0

A uniform representation for multimodal information

Touchscreen Accelerometer
EMMA EMMA
Speech Interaction
recognition EMMA EMMA Geolocation
Manager
EMMA EMMA
EMMA
Fingerprint
Keypad
recognition
Handwriting
recognition

EMMA Structural Elements
EMMA Elements
Provide containers for application
semantics and for multimodal
annotation emma:emma
<emma:emma …> emma:interpretation

<emma:one-of>
<emma:interpretation>
… emma:one-of
</emma:interpretation>
<emma:interpretation> emma:group
…
</emma:one-of> emma:sequence
</emma:emma>
emma:lattice
http://www.w3.org/TR/emma/
EMMA Annotations
Characteristics and processing of input, e.g.:

emma:tokens token of input
emma:process reference to processing
emma:no-input lack of input
emma:uninterpreted uninterpretable input
emma:lang human language of input
emma:signal reference to signal
emma:media-type media type
emma:confidence confidence scores

emma:source annotation of input source
emma:start emma:end Timestamps (absolute/relative)
emma:medium emma:mode medium, mode, and
emma:function function of input
emma:hook hook
EMMA 1.0 – Example Travel Application
INPUT:
"I want to go from Boston
to Denver on March 11"
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

EMMA 1.0 – Same meaning
<emma:interpretation medium="acoustic" mode="voice"

id="int1">
<origin>Boston</origin>
<destination>Denver</destination> Speech
<date>11032009</date>
<emma:interpretation medium="tactile" mode="gui“

id="int1">
<destination>Denver</destination> Mouse
<date>11032009</date>

EMMA 1.0 – Handwriting Input
<emma:interpretation medium="tactile" mode="ink"

id="int1">
<destination>Denver</destination>
<date>11032009</date>

EMMA 1.0 – Biometrics Input
<emma:emma version="1.0"> <emma:emma version="1.0">

<emma:interpretation <emma:interpretation
id="int1" id="int1"
emma:confidence=".75" emma:confidence=".80"
emma:medium="visual" emma:medium="acoustic"
emma:mode="photograph" emma:mode="voice"
emma:verbal="false" emma:verbal="false"
emma:function="identification"> emma:function="identification">
<person>12345</person> <person>12345</person>
<name>Mary Smith</name> <name>Mary Smith</name>
</emma:interpretation> </emma:interpretation>
</emma:emma> </emma:emma>

EMMA 1.0 – Representing Lattices
Speech recognizers, Handwriting recognizers and other input

processing components may provide lattice output:
A graph encoding a range of possible recognition results or

interpretations
portland
today please
flights to austin from
7
1 2 3 4 5 oakland 6 8
boston tomorrow
From Michael Joshnston, AT&T Research

EMMA 1.0 – Representing Lattices
Lattices can be represented using EMMA elements:
<emma:lattice emma:initial="?" emma:final="?">
<emma:arc emma:from="?" emma:to="?">
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation>
<emma:lattice emma:initial="1" emma:final="8">
<emma:arc emma:from="1" emma:to="2">flights</emma:arc>
<emma:arc emma:from="2" emma:to="3">to</emma:arc>
<emma:arc emma:from="3" emma:to="4">boston</emma:arc>
<emma:arc emma:from="3" emma:to="4">austin</emma:arc>
<emma:arc emma:from="4" emma:to="5">from</emma:arc>
<emma:arc emma:from="5" emma:to="6">portland</emma:arc>
<emma:arc emma:from="5" emma:to="6">oakland</emma:arc>
<emma:arc emma:from="6" emma:to="7">today</emma:arc>
<emma:arc emma:from="7" emma:to="8">please</emma:arc>
<emma:arc emma:from="6" emma:to="8">tomorrow</emma:arc>
</emma:lattice>
</emma:emma>
From Michael Joshnston, AT&T Research
EMMA in Multimodal Framework
http://www.w3.org/TR/mmi-framework
EMMA

InkML 1.0 – Digital Ink
Ink Markup Language (InkML), http://www.w3.org/TR/InkML

Data format for presenting digital Ink (pen, stylus, etc)
Allows the input and processing of handwritings, gesture, sketches,
music, etc.
<ink>
<trace>
10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140,
13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135,
58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205
</trace>
<trace>
130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125,
152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200,
150 208, 163 210, 178 208, 192 201, 205 192, 214 180
</trace>
<trace>
227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134,
230 148, 234 162, 235 176, 238 190, 241 204
</trace>
<trace>
282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129,
291 143, 294 157, 294 171, 294 185, 296 199, 300 213
</trace>
<trace>
366 130, 359 143, 354 157, 349 171, 352 185, 359 197,
371 204, 385 205, 398 202, 408 191, 413 177, 413 163,
405 150, 392 143, 378 141, 365 150
</trace>
</ink>
http://www.w3.org/TR/InkML/
InkML 1.0 – Status and Advances
Rich annotation for Ink:

Trace, Trace formats and Trace collections
Contextual information
Canvases
Etc.
Result of classification of InkML traces may be a semantic

representation in EMMA 1.0
Current status is Last Call Working Draft, next will be Candidate

Recommendation with release of an Impl. Report test-suite
Raising interest from major industries
http://www.w3.org/TR/InkML/
MMI Architecture Specification
“Multimodal Architecture and Interfaces“, W3C Working Draft,

http://www.w3.org/TR/mmi-arch/
Runtime Framework provides Delivery Interaction Data

the basic infrastructure and Context
Component Manager Component
controls communication among
the constituents. Runtime Framework
Interaction Manager (IM)

Modality Component API
coordinates Modality
Components (MCs) by life-cycle
Modality Modality
events and contains the shared Component 1 Component N
data (context).
Event-based communication
between IM and MCs.
http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008

MMI Arch – Laboratory Implementation
Implementation of components using W3C markup languages.
Delivery Interaction Data

Context Manager Component
Component
SCXML
Runtime Framework
Modality Component API Modality Component API
HTML
Modality
Component 1
VoiceXML
Modality
Component N
for GUI for VUI

MMI Arch – Laboratory Implementation
SCXML based Interaction Manager.

VoiceXML + HTML modality components.
SCXML interpreter
Server
HTTP I/O Processor
Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)
CCXML/VoiceXML Server
Browser
HTML Browser
Client Telephony interface
Phone Client
GUI modality component Voice modality component

MMI Architecture – Open Issues
Profiles
Start-up, Registration, Delegation

in distributed environment
Transport of Events
Extensibility of Events
http://www.w3.org/TR/mmi-arch/

Emotion in Wikipedia
From Wikipedia definition:
“An emotion is a mental and physiological state associated with a

wide variety of feelings, thoughts, and behaviours. It is a prime
determinant of the sense of subjective well-being and appears to play
a central role in many human activities. As a result of this generality,
the subject has been explored in many, if not all of the human
sciences and art forms. There is much controversy concerning how
emotions are defined and classified.”
General goal: Make interaction between humans and machines more

natural for the humans
Machines should become able:

• to register human emotions (and related states)
• to convey emotions (and related states)
• to “understand” the emotional relevance of events

Emotional States are Numerous
adventurous
Active bellicose
AROUSED hostile
lusting
ASTONISHE TENSE
ALARMED hateful
envious
triumphant D ANGRY AFRAID
EXCITED enraged defiant
Obstructive
Hi Power/Control
self-
confident
ambitious conceited ANNOYED
contemptuo
angry
courageous feeling jealous Angry us
superior
convinced
indignant
DISTRESS
disgusted
ED
loathing
Scherer et al.
FRUSTRATED
elated

DELIGHTEenthusiasti
c light- impatient

discontente
Univ. Geneva
D suspicious bitter
hearted d
determined
amused excited insulted
HAPPY
joyous passionate
Happy interested
expectant
bored
distrustful
startled
Positive feel well

PLEASED
impressed disappointe
Negative
amourous astonished apathetic d
MISERABL
GLAD dissatisfied
E
confident
taken
aback
content hopeful
worried
relaxed SAD
uncomforta
longing feel guilt
solemn attentive ble
DEPRESS
SERENE despondent
languid ashamedED desperate
GLOOMY
CONTENT
AT EASEfriendly
SATISFIED

RELAXED
contemplati
polite
pensive
serious
embarrass Sad
CALM ve ed wavering lonely
melancholi
hesitant
Conducive peaceful
c
BORED Lo Power/Control
anxious
conscientio
sad dejected insecure
us
empathic DROOPY
reverent doubtful
SLEEPY
TIRED
Passive
HUMAINE project
HUMAINE project
European Network of Excellence
Activity: 01/2004 - 12/2007
33 partner institutions from many disciplines
Today: HUMAINE Association (since June 2007)

125 members
Web-site: http://emotion-research.net

Online Speaker Classification
Classification Techniques
Principal Component Analysis (PCA) or Support Vector Machines (SVM): use “kernel
Linear Discriminant Analysis (LDA) – functions” to separate non-linear decision
preprocessing step to reduce feature vector boundaries
dimension Classification and Regression Trees (CART)
K-nearest Neighbor Hidden Markov Models (HMMs) used to
Gaussian Mixture Models: model training model temporal structure
data as Gaussian densities
Artificial Neural Networks (ANN), e.g. MLP:
interesting training algorithms Felix Burkhardt, Colloqium Hochschule Zittau/Görlitz 4.8.2008, Seite 1.

Expressive TTS – Two Approaches
Text+expressive tags
1. Different speech style 1

databases, one for each
expressive style:
Selection Waveform
style 2
Effective solution,
feasible only for a very
limited range of emotions style n
Text+expressive tags
2. Speech signal
manipulation according
to style dependent
prosodic models Prosodic Model
neutral Signal
Waveform
Flexible solution, but style Processing
requires accurate models Selection
and effective signal
processing capabilities
From Enrico Zovato, Loquendo

Expressive TTS – Example Prosodic Patterns
Synthesis of two basic emotional styles through prosodic modification:

different intonation contours
different acoustic units duration
500
POS (“happy”)
400
Frequency (Hz)
NEG (“sad”)
300
200
100
0
0 1.8
Time (s)
NEG POS
Male-UK
From Enrico Zovato, Loquendo Female-UK
Emotions in ECAs
From Piero Cosi, CNR,

Padova

W3C Emotion Incubator
“The W3C Incubator Activity fosters rapid development,

on a time scale of a year or less, of new Web-related
concepts. Target concepts include innovative ideas for
specifications, guidelines, and applications that are
not (or not yet) clear candidates as Web standards
developed through the more thorough process afforded by
the W3C Recommendation Track.”
W3C Emotion Incubator Aims:

First Charter XG (2006-2007):
“...to investigate the prospects of defining a general-purpose Emotion annotation and
representation language...”
“...which should be usable in a large variety of technological contexts where
emotions need to be represented.”
Second Charter XG (Nov. 2007 – Nov. 2008):
Prioritize the requirements;
Release a first specification draft;
Illustrate how to combine the Emotion Markup Language with existing markup
languages.

W3C Emotion Incubator – Members
Chairman: Marc Schröder, DFKI

W3C Members: Invited Experts:
DFKI Emotion AI
Loquendo Univ. Paris 8
Deutsche Telekom Uuniv. Basque Country
SRI International Univ. C. Cork
NTUA OFAI, Austria
Fraunhofer IPCA, Portugal
Chinese Acad. Science Tech.Univ. Munich
Web space: http://www.w3.org/2005/Incubator/emotion

Results:
• Use case description document
• Requirements document
• Final Report (20 Nov 2008): Elements of an EmotionML 1.0
http://www.w3.org/2005/Incubator/emotion/XGR-emotionml/
W3C Emotion Incubator – EmotionML 1.0
Document structure:
container element (<emotionml>), single emotion annotation (<emotion>)
Representation of emotions:
<category> element, <dimensions> element, <appraisals> element,
<action-tendency> element, <intensity> element
Meta information:
confidence attribute, <modality> element, <metadata> element
Links and time:

<link> element, <timing> element
Scale values
value attribute, <traces> element

EmotionML 1.0 – Examples
Expression of emotions in SSML 1.1:
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:emo="http://www.w3.org/2008/11/emotionml"
xml:lang="en-US">
<s>
<emo:emotion>
<emo:category set="everydayEmotions" name="doubt"/>
<emo:intensity value="0.4"/>
</emo:emotion>
Do you need help?

</s>
</speak>
Detection of emotions in EMMA 1.0:

<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns="http://www.w3.org/2008/11/emotionml">
<emma:interpretation start="12457990" end="12457995" mode="voice" verbal="false">
<emotion>
<intensity value="0.1" confidence="0.8"/>
<category set="everydayEmotions" name="boredom" confidence="0.1"/>
</emotion>
</emma:emma>

Overview
A Bit of History

ASR/DMTF
TTS
Lexicons

MMI Architecture
EMMA and InkML
Next Future
W3C VBWG/MMIWG – Next Future
Spec for the next generation of Voice Browsing
SCXML 1.0
VoiceXML 3.0

State Charts - SCXML
State Chart XML (SCXML):

http://www.w3.org/TR/2008/WD-scxml-20080516/
Powerful State-Machine Language
Based on David Harel’s State Charts (see his book)
Adopted by in UML
Standard under development by W3C VBWG
http://www.w3.org/TR/scxml/
States, Transitions, Events

Data model extends basic finite state automaton
Conditions on transitions
Nested States
Represents task decomposition
In multiple dependent states at same time
Parallel States
Represent fork/join logic
Wide interest:
VBWG, MMI WG, Other W3C groups, Universities, Industries
Already available Open Source Implementations

SCXML 1.0 – Parallel State Charts

SCXML as MMI Interaction Manager
SCXML Interaction Manager
Vo
i ce
Mo
d alit
y
Modality
ure
Gest
Visual Modality

SCXML for VoiceXML 3.0
SCXML Interaction Manager
Vo
i ce
Mo
d alit
y Modality
ure
Gest
Visual Modality

SCXML 1.0 – Open Issues
Data model:
ECMA Script (ECMA-262) or other formats?
Definition of Profiles
Other

Re-Thinking VoiceXML – VoiceXML 3.0
Well-founded:
From syntactic description to a semantic model
Extensible:
SIV, EMMA support, rich media, VCR control, etc.
Profiled:
light profile (mobile?), media profile (scalability), VoiceXML 2.1
profile (interoperability), etc.
Flexibility:
Customization of FIA (Form Interpretation Algorithm)

VoiceXML 3.0 – Separation of Concerns
SCXML 1.0
Application and interaction logic
VoiceXML 3.0:
Voice Interaction only, under control of SCXML
VoiceXML 3.0 has been published as a First Working

Draft, http://www.w3.org/TR/2008/WD-voicexml30-20081219/
Send public comments

THANK YOU
for clarifications or questions:
paolo.baggia@loquendo.com

THANK YOU
For more information please:
Keep an eye on: www.loquendo.com Loquendo S.p.A.
745 Fifth Ave, 27th Floor
Contact: paolo.baggia@loquendo.com New York, NY 10151
USA
Tel. +1 212.310.9075
Keep in touch with Loquendo news, subscribe to Fax. +1 212.310.9001
the Loquendo Newsletter www.loquendo.com
Try our interactive TTS demo: insert your text,
choose a language, and listen Loquendo S.p.A.
Via Olivetti, 6
The latest News at a click
10148 TORINO
Consult the Loquendo Newsletter online Italy
Keep up to date on events and initiatives Tel. +39 011 291 3111
Fax +39 011 291 3199
For further information, fill in our Contacts Form www.loquendo.com

Voice Browsing and Multimodal Interaction in 2009

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Voice Browsing and Multimodal Interaction in 2009

Hochgeladen von

Copyright:

Verfügbare Formate

Voice Browser and Multimodal Interaction In 2009

March 6th, 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 11

W3C Speech Interaction Framework Today

W3C Multimodal Interaction Today

Privately held company (fully owned by Telecom Italia), founded in 2001 as

Offices in New York. Headquarters in Torino, Torino

local representative sales offices in Rome, New York

“2008 Frost & Sullivan European Telematics and Infotainment

Winner of “Market leader-Best Speech Engine” Speech

Loquendo MRCP Server: Winner of 2008 IP Contact

“Best Innovation in Automotive Speech Synthesis” Prize

“Best Innovation in Expressive Speech Synthesis” Prize

“Best Innovation in Multi-Lingual Speech Synthesis”

Google TechTalk – Mar 6th, 2009 Paolo Baggia 4

Google TechTalk – Mar 6th, 2009 Paolo Baggia 5

Two industrial forums:

Other relevant bodies:

Google TechTalk – Mar 6th, 2009 Paolo Baggia 6

Preparing to announce VoiceXML 1.0

Left to right: Gerald Karam (AT&T), Linda Boyer (IBM),

Google TechTalk – Mar 6th, 2009 Paolo Baggia 7

Semantic Interpretation for

N-gram Grammar ML VoiceXML 2.1

Pronunciation Lexicon Dialog

User Pre-recorded Audio Player

Speech Synthesis Reusable Components Call Control XML

Google TechTalk – Mar 6th, 2009 Paolo Baggia 8

Semantic Interpretation for

N-gram Grammar ML VoiceXML 2.1

Speech Recognition Natural Language

Pronunciation Lexicon Dialog

User Pre-recorded Audio Player

Speech Synthesis Reusable Components Call Control XML

Google TechTalk – Mar 6th, 2009 Paolo Baggia 9

Semantic Interpretation for

N-gram Grammar ML VoiceXML 2.1

Pronunciation Lexicon Dialog

User Pre-recorded Audio Player

Speech Synthesis Reusable Components Call Control XML

Google TechTalk – Mar 6th, 2009 Paolo Baggia 10

Google TechTalk – Mar 6th, 2009 Paolo Baggia 11

Traditional (proprietary) architecture

.ssml, .wav/.mp3, .pls

Google TechTalk – Mar 6th, 2009 Paolo Baggia 12

VoiceXML changed the landscape of IVRs and speech application

Google TechTalk – Mar 6th, 2009 Paolo Baggia 13

W3C Speech Interaction Framework Today

W3C Multimodal Interaction Today

Google TechTalk – Mar 6th, 2009 Paolo Baggia 15

Google TechTalk – Mar 6th, 2009 Paolo Baggia 16

SRGS XML SRGS ABNF

<?xml version="1.0" encoding="UTF-8"?>

<?xml version="1.0" encoding="UTF-8"?>

<rule id="main" scope="public">

Google TechTalk – Mar 6th, 2009 Paolo Baggia 17

Powerful syntax (CFG) and very powerful semantics (ECMA)

Two syntax XML and ABNF are great!

Transformations are possible

Open Source tools might be created to:

Google TechTalk – Mar 6th, 2009 Paolo Baggia 18

Semantics declaration: tag-format attribute

SISR Script Semantics