Sie sind auf Seite 1von 118

Voice Browser and Multimodal Interaction In 2009

Paolo Baggia
Director of International Standards

March 6th, 2009

Google TechTalk

Google TechTalk – Mar 6th, 2009 Paolo Baggia 11


Overview

A Bit of History

W3C Speech Interaction Framework Today


ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions

W3C Multimodal Interaction Today


MMI Architecture
EMMA and InkML
A language for Emotions

Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 2
Company Profile

 Privately held company (fully owned by Telecom Italia), founded in 2001 as


spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and
expertise in voice processing.
 Global Company, leader in Europe and South America for award-winning, high
quality voice technologies (synthesis, recognition, authentication and
identification) available in 26 languages and 62 voices.
 Multilingual, proprietary technologies protected
over 100 patents worldwide Munich
London
 Financially robust, break-even reached in 2004,
revenues and earnings growing year on year
Paris
 Growth-plan investment approved for
the evolution of products and services. Madrid

 Offices in New York. Headquarters in Torino, Torino

local representative sales offices in Rome, New York


Rome
Madrid, Paris, London, Munich
 Flexible: About 100 employees, plus a
vibrant ecosystem of local freelancers.
Google TechTalk – Mar 6th, 2009 Paolo Baggia 3
International Awards

“2008 Frost & Sullivan European Telematics and Infotainment


Emerging Company of the Year” Award

Winner of “Market leader-Best Speech Engine” Speech


Industry Award 2007 and 2008

Loquendo MRCP Server: Winner of 2008 IP Contact


Center Technology Pioneer Award

“Best Innovation in Automotive Speech Synthesis” Prize


AVIOS-SpeechTEK West 2007

“Best Innovation in Expressive Speech Synthesis” Prize


AVIOS-SpeechTEK West 2006

“Best Innovation in Multi-Lingual Speech Synthesis”


Prize AVIOS-SpeechTEK West 2005

Google TechTalk – Mar 6th, 2009 Paolo Baggia 4


A Bit of History

Google TechTalk – Mar 6th, 2009 Paolo Baggia 5


Standard Bodies
Two main standard bodies:
W3C – World Wide Web Consortium
Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full
potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).
400 members all over the world, 50 Working, Interest and Coordination Groups.
W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM,
SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web
Accessibility, Device Independence)
IETF – Internet Engineering Task Force
Founded in 1986, but growth in 1991as Internet Society. 1300 members.
HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP)
is very relevant for speech platforms.

Two industrial forums:


VoiceXML Forum (www.voicexml.org)
Inventors of VoiceXML 1.0, then submitted to W3C for standardization.
Current goal is to promote, disseminate and support VoiceXML and related standards.
SALT Forum (www.saltforum.org)
Supported by Microsoft to define a lightweight markup for telephony and multimodal
applications.

Other relevant bodies:


3GPP, OMA, ETSI, NIST

Google TechTalk – Mar 6th, 2009 Paolo Baggia 6


The (r)evolution of VoiceXML
1998 - 2004

W3C charters
W3C charters
Voice Browser
Multimodal Interaction
WG
WG
EMMA 1.0
VoiceXML By Cisco, Comverse,
SALT Forum W3C Rec
Forum Birth Intel, Microsoft, Philips,
Birth SpeechWorks, PLS 1.0
By AT&T, IBM, W3C REC
Lucent, Motorola, 2007
1998 2000 2004
2008 2009
1999 2002
SSML 1.0
W3C Voice W3C1.0
SRGS Rec SISR 1.0
Browser VoiceXML 1.0 W3C 2.0
VoiceXML Rec VoiceXMLRec
W3C 2.0
Workshop Released W3C Rec W3C Rec

Preparing to announce VoiceXML 1.0


Friday Feb. 25th, 2000
Lucent, Naperville, Illinois

Left to right: Gerald Karam (AT&T), Linda Boyer (IBM),


Ken Rehor (Lucent), Bruce Lucas (IBM),
Pete Danielsen (Lucent), Jim Ferrans (Motorola),
Dave Ladd (Motorola).

Google TechTalk – Mar 6th, 2009 Paolo Baggia 7


Speech Interface Framework in 2000
(by Jim Larson)

Semantic Interpretation for


Speech Recognition (SISR)

N-gram Grammar ML VoiceXML 2.1


EMMA
Speech Recognition Natural Language
Grammar Spec. (SRGS) VoiceXML 2.0
Semantics ML

Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer

Pronunciation Lexicon Dialog


Specification (PLS) Manager

User Pre-recorded Audio Player


Telephone
Media System
Planning
Language
TTS
Generation

Speech Synthesis Reusable Components Call Control XML


Markup Language (SSML) (CCXML)

Google TechTalk – Mar 6th, 2009 Paolo Baggia 8


Speech Interface Framework - Today
(by Jim Larson)

Semantic Interpretation for


Speech Recognition (SISR)

N-gram Grammar ML VoiceXML 2.1


EMMA 1.0

Speech Recognition Natural Language


Grammar Spec. (SRGS) VoiceXML 2.0
Semantics ML

Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer

Pronunciation Lexicon Dialog


Specification (PLS) Manager

User Pre-recorded Audio Player


Telephone
Media System
Planning
Language
TTS
Generation

Speech Synthesis Reusable Components Call Control XML


Markup Language (SSML) (CCXML)

Google TechTalk – Mar 6th, 2009 Paolo Baggia 9


Speech Interface Framework - End of 2009
(by Jim Larson)

Semantic Interpretation for


Speech Recognition (SISR)

N-gram Grammar ML VoiceXML 2.1


EMMA 1.0
Speech Recognition Natural Language
Grammar Spec. (SRGS) VoiceXML 2.0
Semantics ML

Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer

Pronunciation Lexicon Dialog


Specification (PLS) Manager

User Pre-recorded Audio Player


Telephone
Media System
Planning
Language
TTS
Generation

Speech Synthesis Reusable Components Call Control XML


Markup Language (SSML) (CCXML)

Google TechTalk – Mar 6th, 2009 Paolo Baggia 10


W3C Process

Google TechTalk – Mar 6th, 2009 Paolo Baggia 11


Architectural Changes

Traditional (proprietary) architecture

ASR / DTMF
Speech Proprietary
User SCE
Applic.
TTS / Audio
Proprietary
platform

.grxml/.gram, .pls
VoiceXML architecture

ASR / DTMF
.vxml
VoiceXML Web
User
Browser HTTP Applic.
TTS / Audio
VoiceXML
platform

.ssml, .wav/.mp3, .pls

Google TechTalk – Mar 6th, 2009 Paolo Baggia 12


The VoiceXML Impact

VoiceXML changed the landscape of IVRs and speech application


creation
From proprietary to standard-based speech applications

Before After
• Proprietary platforms • Standard VoiceXML
(HW & SW) platforms
• Proprietary • Standards for Speech
applications (by Technologies
proprietary SCE) • Standard tools for
• Mainly DTMF and VoiceXML applications
pre-recorded prompts • Integration of DTMF
• First attempts to add and ASR
speech into IVR • Still predominance of
DTMF, but more and
more speech
applications

Google TechTalk – Mar 6th, 2009 Paolo Baggia 13


Overview

 A Bit of History

W3C Speech Interaction Framework Today


ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions

 W3C Multimodal Interaction Today


 MMI Architecture
 EMMA and InkML
 A language for Emotions

 Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 14
Standards for ASR and DTMF
SRGS 1.0, SISR 1.0

Google TechTalk – Mar 6th, 2009 Paolo Baggia 15


W3C Standards for Speech/DTMF Grammars

SYNTAX SEMANTICS
Defines constraints on Speech Describes how to
admissible sentences for grammar produce results after
a specific recognition turn an utterance is recognized

SRGS
SRGS SISR
SISR

ABNF
ABNF XML
XML literal
literal script
script

voice
voice dtmf
dtmf
http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/

Google TechTalk – Mar 6th, 2009 Paolo Baggia 16


SRGS/SISR Grammars for “Torino”

SRGS XML SRGS ABNF

<?xml version="1.0" encoding="UTF-8"?>


<grammar xml:lang="en-US" version="1.0"
xmlns="http://www.w3.org/2001/06/grammar" #ABNF 1.0 iso-8859-1;
tag-format="semantics/1.0-literals">
SISR <rule id="main" scope="public">
mode voice;
tag-format <semantics/1.0-literals>;
<token>Torino</token>
literal <tag>10100</tag>
public $main = Torino {10100} ;
</rule>

</grammar>

<?xml version="1.0" encoding="UTF-8"?>


<grammar xml:lang="en-US" version="1.0"
xmlns="http://www.w3.org/2001/06/grammar #ABNF 1.0 iso-8859-1;
" tag-format="semantics/1.0"> mode voice;
SISR <tag>var unused=7;</tag>
tag-format <semantics/1.0>;

<rule id="main" scope="public">


script <token>Torino</token>
<tag>out="10100";</tag>
{var unused=7;};
public $main = Torino {out="10100";} ;
</rule>

</grammar>

Google TechTalk – Mar 6th, 2009 Paolo Baggia 17


SRGS/SISR Standards – Pros

Powerful syntax (CFG) and very powerful semantics (ECMA)


DMTF and Voice input are transparent to the application
Wide and consistent adoption among technology vendors

Two syntax XML and ABNF are great!


 Developers can choose (XML validation vs. compact format)

 Transformations are possible


XML  ABNF (easy, simple XSLT)
ABNF  XML (requires a ABNF parser)

 Open Source tools might be created to:


 Validate grammar syntax
 Transform grammars
 Debug grammars on written input
 Coverage tests: explode covered sentences, GenSem, SemTester, etc.

Google TechTalk – Mar 6th, 2009 Paolo Baggia 18


SRGS/SISR Standards – Small Issues

Semantics declaration: tag-format attribute


 If value “semantics/1.0”?
 Mandate SISR Script semantics inside semantic tags
 If value “semantics/1.0-literal”?
 Mandate SISR Literal semantics inside semantic tags
 If missing?
 Unclear! Risk of interoperability troubles

SISR Script Semantics


 Clumsy default assignment: returns last referenced rule only
 Developer must properly pop-up results
 Be careful to redefine “out”
 Assign a scalar value might result in errors

SISR Literal Semantics


 Only useful for very simple word-list rules
 No support for encapsulating rules
 SISR Literal grammars as external references ONLY!

Google TechTalk – Mar 6th, 2009 Paolo Baggia 19


SRGS/SISR – Encapsulated Grammars

Gr2.gram
Literal

Gr1.grxml Gr41.grxml
Literal
Script

Gr3.grxml
Script

Gr42.gram
Script

Google TechTalk – Mar 6th, 2009 Paolo Baggia 20


SRGS/SISR Standards – Rich XML Results
Section 7 of SISR 1.0 specification
http://www.w3.org/TR/semantic-interpretation/#SI7
Serialization rules from SISR ECMA results into XML
Edge cases:
Arrays
Special variable “_attribute” and “_value”
Creation of namespaces and prefixes
{
drink: {
_nsdecl: {
_prefix:"n1",
_name:"http://www.example.com/n1"
},
_nsprefix:"n1",
liquid: {
_nsdecl: {
_prefix:"n2", <n1:drink xmlns:n1="http://www.example.com/n1">
_name:"http://www.example.com/n2" <liquid n2:color="black“
}, xmlns:n2="http://www.example.com/n2">coke</liquid>
_attributes: { <size>medium</size>
color: { </n1:drink>
_nsprefix:"n2",
_value:"black"
}
},
_value:"coke"
},
size:"medium"
}
}

Google TechTalk – Mar 6th, 2009 Paolo Baggia 21


SRGS/SISR Standards – Next Steps

Adoption of the PLS 1.0 lexicon


 Clear entry point into PLS lexicons, <token> element
 Missing role attribute in <token> to allow homographs
disambiguation

Next extensions via Errata


 XML 1.1 support and IR
 Update normative references

 No Major Extensions are needed!

Google TechTalk – Mar 6th, 2009 Paolo Baggia 22


Speech Synthesis
SSML 1.0/1.1

Google TechTalk – Mar 6th, 2009 Paolo Baggia 23


TTS – Functional Architecture and
Markup/Non-Markup support

Text-to-
Structure Text Prosody Waveform
Phoneme
Analysis Normalization Analysis Production
Conversion

Markup support: Markup support:


<phoneme>, <lexicon> Markup support:
<p>, <s>
Non-Markup support: <voice>, <audio>
Non-Markup support:
look up in pronunciation Non-Markup support:
infer the structure by
automatic text analysis dictionary

Markup support:
Markup support: <emphasis>, <break>, <prosody>
<say-as> for date, time, phone number, numbers Non-Markup support:
<sub> for acronyms and transliterations automatically generate prosody through analysis of
Non-Markup support: document structure and sentence syntax
automatically identify and convert constructs

http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 24
SSML 1.0 – Language description (I)
version attribute
Document Structure SSML namespace attribute
<speak> root element
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<p>I don't speak Japanese.</p>
<p xml:lang="ja">Nihongo-ga wakarimasen.</p>
Languages </speak>

Processing and Pronunciation


– <p> and <s> (paragraph and sentence)
to give a structure to the text
– <say-as> element
to indicate the type of text construct contained within the element
ex. date, numbers, etc.
– <phoneme> element
to provides a phonetic pronunciation for the contained text in IPA
– <sub> element
to provide substitutions for expanding acronyms in sequence of
words
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 25
SSML 1.0 – Language description (II)
Style
- <voice> element
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">

The moon is raising on the beach, when John says,


looking Mary in the eyes:
<voice name="simon">I love you!</voice>
but she suddenly replies:
<voice name="susan"> Please, be serious! </voice>
</speak>

Other voice selection attributes are:


name, xml:lang, gender, age, and variant

- <emphasis> element
requests that the contained text be spoken with emphasis
level attribute can set it to strong, moderate, reduced, or none
- <break> element
controls the pausing between words
time attribute with two kind of values:
Time expressions “5s”, “20ms”
strength attribute with values:
none, x-weak, weak, medium (default value), strong, or x-strong
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 26
SSML 1.0 – Language description (III)

Prosody
<prosody> element
permits control of the pitch, speaking rate and volume of the
speech output.

The attributes are:


volume: the volume for the contained text.
rate: the speaking rate in words-per-minute for the contained text.
duration: a value in seconds or milliseconds for the desired time to take
to read the element contents.
pitch: the baseline pitch for the contained text.
range: the pitch range (variability) for the contained text in Hertz.
contour: sets the actual pitch contour for the contained text.

Other elements
<audio> element - to play an audio file
<mark> element - to place a marker into the text/tag sequence
<desc> element - to provide a description of a non-speech audio
source in <audio>
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 27
Towards SSML 1.1 – Motivations

Internationalization needs:
 Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
 Results:
 No major needs for Eastern and Western European languages
 Many issues for Far East languages (Mandarin, Japanese, Korean)
 Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many
Indian languages
 Mark input with or without vowels
 Mark the transliteration schema used for input

Extensions required by Voice Browser:


 More powerful error handling, selection of fall-back strategies
 Trimming attributes
 Volume attribute to adopt a logarithmic scale (before was linear)

Alignment with PLS 1.0 specification for user lexicons


http://www.w3.org/TR/speech-synthesis11/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 28
SSML 1.1 – Language Changes

<w> element

Lexicon extensions
<lookup> element
permits control of the pitch, speaking rate and volume of the
speech output.

Phonetic Alphabet Registry creation and adoption


 "ipa" for International Phonetic Alphabet
 Registering policy for other phonetic alphabets, similar to LTRU for
Language tags
 Candidates:
 PinYin for Mandarin Chinese
 JEITA for Japanese
 X-SAMPA, ASCII transliteration of IPA codes

http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 29
Pronunciation Lexicon
PLS 1.0

Google TechTalk – Mar 6th, 2009 Paolo Baggia 30


Pronunciation Lexicons

Pronunciation Lexicon
A mapping between words (or short phrases), their written representations,
and their pronunciations suitable for use by an ASR engine or a TTS
engine

Pronunciation lexicons are not only useful for voice browsers


They have also proven effective mechanisms to support accessibility for the
differently able as well as greater usability for all users
They are used to good effect in screen readers and user agents supporting
multimodal interfaces

The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is


designed to enable interoperable specification of pronunciation
lexicons

http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 31
PLS 1.0 – Language Overview

A PLS document is a container (<lexicon>) of several lexical entries


(<lexeme>)

Each lexical entry contains


One or more spellings (<grapheme>)
One or more pronunciations (<phoneme>) or substitutions (<alias>)

Each PLS document is related to a single unique language (xml:lang)

SSML 1.0 and SRGS 1.0 documents can reference one or more PLS
documents

Current version doesn’t include morphological, syntactic and semantic


information associated with pronunciations

http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 32
PLS 1.0 – An Example

<?xml version="1.0" encoding="UTF-8"?>


<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/pronunciation-lexicon/pls.xsd"
alphabet="ipa" xml:lang="en-US">

<lexeme>
<grapheme>Sepulveda</grapheme>
<phoneme>səˈˈpȜ
ȜlvǺǺdə</phoneme>
</lexeme>

<lexeme>
<grapheme>W3C</grapheme>
<alias>World Wide Web Consortium</alias>
</lexeme>

</lexicon>

http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 33
PLS 1.0 – Used for TTS

SSML 1.0
<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.0" … xml:lang="en-US">
<lexicon uri="http://www.example.com/SSMLexample.pls"/>
The title of the movie is: "La vita è bella" (Life is beautiful),
which is directed by Benigni.
</speak>

PLS 1.0
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" … alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>La vita è bella</grapheme>
<phoneme>ˈˈlǡǡ ˈviːːȎə ˈȤeǺǺ ˈbǫǫlə</phoneme>
</lexeme>
<lexeme>
<grapheme>Benigni</grapheme>
<phoneme>bǫǫˈniːːnji</phoneme>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 34
PLS 1.0 – Used for ASR

SRGS 1.0
<?xml version="1.0" encoding="UTF-8"?>
<grammar version="1.0“ xml:lang="en-US" root="movies" mode="voice">
<lexicon uri="http://www.example.com/SRGSexample.pls"/>
<rule id="movies" scope="public">
<one-of>
<item>Terminator 2: Judgment Day</item>
<item>Pluto's Judgement Day</item>
</one-of>
</rule>
</grammar>

PLS 1.0
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" … alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>judgment</grapheme>
<grapheme>judgement</grapheme>
<phoneme>ˈˈdʒȜ
Ȝdʒ.mənt</phoneme>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 35
Examples of Use

Multiple pronunciations for the same orthography

Multiple orthographies

Homophones

Homographs

Acronyms, Abbreviations, etc.

Detailed descriptions can be found in:


W3C specification, Wikipedia
Paolo Baggia, SpeechTEK 2008 & Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 36


PLS 1.0 – Open Issues

No wide support of IPA in speech engines


 Slowly changes are under way
 Phonetic Alphabet Registry will open doors to other alphabets in a
controlled and interoperable way

Integration in ASR/TTS
 SSML 1.1 will interoperate with PLS 1.0
 SRGS 1.0 still missing support of role attribute for PLS 1.0

No matching algorithm inside PLS, because it is mainly a data


format

http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 37
Pronunciation Alphabets
IPA, SAMPA

Google TechTalk – Mar 6th, 2009 Paolo Baggia 38


International Phonetic Alphabet

Pronunciation is represented by a phonetic alphabet


 Standard phonetic alphabets
International Phonetic Alphabet (IPA)
 Well known phonetic alphabet
SAMPA - ASCII based (simple to write)
Pinyin (Chinese Mandarin), JEITA (Japanese), etc.
 Proprietary phonetic alphabets

International Phonetic Alphabet (IPA)


 Created by International Phonetic Association (active since 1896),
collaborative effort by all the major phoneticians around the world
 Universally agreed system of notation for sounds of languages
 Covers all languages
 Requires UNICODE to write it
 Normatively referenced by PLS

Google TechTalk – Mar 6th, 2009 Paolo Baggia 39


IPA – Chart
IPA was founded in 1886
It is the major international
association of phoneticians
The IPA alphabet provides
symbols making possible the
phonemic transcription of all
known languages

IPA characters can be encoded in


Unicode by supplementing
ASCII with characters from
other ranges, particularly:
IPA extensions (0250–02AF)
Latin Extended-A (0100-017F)
See the detailed:
http://www.unicode.org/charts

Google TechTalk – Mar 6th, 2009 Paolo Baggia 40


Phonetic Alphabets – Issues

The real problem is how to write pronunciation in a reliable, unless


you are trained phonetician
Issues with fonts and authoring, browsers, but Unicode fonts today
support IPA extensions, see:
 http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm
There are very few tools to help writing pronunciations and to let
you listen to what you have written

 Make available pronunciations in IPA or other general phonetic


languages.

Google TechTalk – Mar 6th, 2009 Paolo Baggia 41


Voice Dialog languages:
VoiceXML 2.0
VoiceXML 2.1

Google TechTalk – Mar 6th, 2009 Paolo Baggia 42


VoiceXML 2.0 – Features, Elements

Menus, forms, sub-dialogs Events


<menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>,
Input <catch>, <throw>
Speech recognition Transition and submission
<grammar> <goto>, <submit>
Recording Telephony
<record>
Connection control
Keypad <transfer>, <disconnect>
<grammar mode="dtmf">
Telephony information
Output
Platform specifics
Audio files <object>
<audio>
Text-To-Speech Performance
<prompt> Fetch
Variables (ECMA-262) Properties
<var>, <assign>, <script>
scoping rules
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 43
VoiceXML 2.0 – Execution Model

Execution is synchronous
 Only disconnect event is handled (somewhat) asynchronous

Execution is always in a single dialog: <form> or <menu>


 Form Interpretation Algorithm for <field> selection

Prompt are queued


 Played only when encountering a waiting state
 Played before a fetchaudio is started

Processing is always in one of two states:


 Waiting for input in an input item:
<field>, <record>, <transfer>, etc.
 Transitioning between input items in response of an input

Event-driven:
 <nomatch>, <noinput> user’s input event handling
 <catch>, <throw> generalized event mechanism
 connection.* call event handling
 error.* error event handling
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 44
VoiceXML 2.1 – Extended Features
Dynamically referencing grammars and scripts:
<grammar expr="…">, <script expr="…">

Record user’s utterance during form filling


recordutterance property
Add new shadow variables: recording, recordingsize, recordingduration

Detect barge-in during prompt playback (SSML <mark>)


Add markexpr attribute
Add new shadow variables: markname and marktime

Fetch XML data without transition


Use read-only subset of DOM
Dynamically concatenate prompts <foreach>
Iterate throught ECMAScript arrays and execute content

Send data upon disconnect


<disconnect namelist="…">
Additional transfer type
<transfer type="consultation">
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 45
VoiceXML Applications

Static VoiceXML applications


 The VoiceXML page is always the same, so the user experience
 No personalization or customization

Dynamic VoiceXML applications


 User experience is customized
• After authentication (PIN)
• Using caller-id or SIP-id
 Data driven
 Dynamic pages generated at runtime
e.g. JSP, ASP, etc.

http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 46
A Drawback of VoiceXML 2.0

A drawback of VoiceXML is that the transition from a VoiceXML page


to another is a costly activity:
 Fetch the new page, if not cached
 Parse the page
 Initialize the context, possibly loading and initializing a new application
root document
 Load or pre-compile scripts

The transitions are the only way to return data to the Web Application
(if the VoiceXML is dynamic)

Pages must be created to include dynamic data

 VoiceXML 2.1 addresses part of this drawback by feeding dynamic


data to a running VoiceXML page

http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 47
Advantages of VoiceXML 2.1 - AJAX

Two of the eight new features in VoiceXML 2.1 helps to create


more dynamic VoiceXML applications:
 <data> element
 <foreach> element

Static VoiceXML document can fetch user-specific data at runtime,


without changing the VoiceXML document
<data> element allows retrieval of arbitrary XML data without
VoiceXML document transitions
Returned XML data are accessible by a subset of DOM primitives
<foreach> extend the prompts to allow the iteration on a dynamic
array of information to create a dynamic prompt

This is similar to AJAX programming for HTML services


It decouples presentation layer (VoiceXML) from business logic
(accessed via <data>)
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 48
VoiceXML 2.1 – <data> Element

Attributes:
 name the variable to be filled with the DOM of the retrieved data
 scr or srcexpr the URI of the location of the XML data
 namelist the list of variables to be submitted
 method either ‘get’ or ‘post’
 enctype media encoding
 fetch and caching attributes

As <var>, it may appear in executable content (<form> and <vxml>)


The value of name must be a declared variable
The platform will fill the variable of the DOM of the fetched XML data
<data> element is synchronous (the service stops to get data)

http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 49
VoiceXML 2.1 – <foreach> Element

Attributes:
 array ECMAScript expression that must evaluate to ECMAScript array
 item the variable that stores the element to be processed

<foreach> allows the application to iterate on an ECMAScript array and


to execute the content
<foreach> may appear:
 In executable content (all executable content elements may appear as
content of <foreach>)
 In <prompt> (restrictions on the content are applied)
<foreach> allows sophisticated concatenation of prompts

http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 50
VoiceXML – Final Remarks

The changed landscape for speech application development:


 Virtually all the IVRs today support VoiceXML
 New options related to VoiceXML:
 SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)
 Large hosting of speech applications (TellMe, Voxeo)
 Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)
 Further changes may come from the CCXML adoption

… but:
 Mainly system driven applications are actually deployed
 New challenges to incorporate more powerful dialog strategies,
mixed-initiative are under discussion.

http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 51
VoiceXML Resources

Voice Browser Working Group (spec, FAQ, implementations, resources):


http://www.w3.org/Voice/

VoiceXML Forum site (resources, education, interest groups):


http://www.voicexml.org/
VoiceXML Forum Review:
http://www.voicexmlreview.org/
Interesting articles related to VoiceXML and more
Example code in the sections "First Words" and "Speak & Listen"

Ken Rehor’s World of VoiceXML


http://www.kenrehor.com/voicexml

Online documentation related to VoiceXML Platforms


Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie
Many books on VoiceXML:
Jim Larson, "VoiceXML Introduction to Developing Speech Applications", Prentice-Hall,
2002.
A. Hocek, D. Cuddihy, "Definitive VoiceXML", Prentice-Hall, 2002

Google TechTalk – Mar 6th, 2009 Paolo Baggia 52


Call Control:
CCXML 1.0

Google TechTalk – Mar 6th, 2009 Paolo Baggia 53


CCXML 1.0 – Highlights

Asynchronous event processing

Acceptance or refusal of an incoming call

Different type of transfer call management

Outbound call activation (interaction with an external entity)

Use of ECMAScript adding scripting capabilities to call control


applications

VoiceXML modularization

Conferencing management

Google TechTalk – Mar 6th, 2009 Paolo Baggia 54


CCXML 1.0 – Elements Relationship

Google TechTalk – Mar 6th, 2009 Paolo Baggia 55


CCXML 1.0 – Incoming Call
CCXML document
Event catching and processing
<?xml version="1.0"
encoding="UTF-8"?>
<ccxml version="1.0">

[…]

CCXML <transition
connection.alerting event="connection.alerting">
Interpreter
[…]
</transition>

event$ <transition
event="connection.disconnected">
[…]
name:’connection.alerting’;
</transition>
connectionid:‘0239023901903993’;
eventid:’00001’; ....
…..

http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009 Paolo Baggia 56


CCXML 1.0 – connection.alerting Event

Basic telephony information has been retrieved on alerting event and


is available into CCXML document:
Local URI, remote URI, protocol used, redirection info, etc.

Based on certain checked info, CCXML can accept or refuse the


incoming call, even before contacting the dialog server;

Any error that can occur during the phone call can be managed by
CCXML service (connection.failed, error.connection events)

Call Control CCXML VoiceXML


Adapter Interpreter Interpreter

connection.alerting

Analyzing events$ content


<accept/> | <reject/>

http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009 Paolo Baggia 57


CCXML 1.0 – How to activate a new dialog
CCXML actions:
 Receives alerting event from Call Control Adapter
 Asks to dialog server to prepare a new dialog
 Waits for the preparation
 If the dialog has been successfully prepared, accept the call
 Asks to dialog server to start the prepared new dialog

Call Control CCXML VoiceXML


Adapter Interpreter Interpreter
alerting
prepare a new dialog
dialog prepared
call accepted
connected
start the prepared dialog
dialog started

Google TechTalk – Mar 6th, 2009 Paolo Baggia 58


Call transfer

CCXML supports transfer call of different modality: "bridge", "blind",


"consultation";
Based on different modalities features CCXML language allows the expected
interaction with the Call Control Adapter to correctly perform the transfer;
During the different phases of transfer call creation the CCXML can receive
any asynchronous event and correctly manage it, interrupting the call, if
requested

Call Control CCXML VoiceXML


Adapter Interpreter Interpreter

Performing a transfer
command1
answer1

[…]
transfer complete …

Google TechTalk – Mar 6th, 2009 Paolo Baggia 59


External Events

CCXML Interpreter Context can receive events from an external entity


able to use the HTTP protocol;
Events generated in this way must be sent to a CCXML by a POST
HTTP command
A event is so performed and:
 It can be addressed on a new session whose creation must be requested
 It can be addressed on an existent session, specifying the ID in the
request
CCXML External
Interpreter Entity

basic http event

Event
management
Event management result

http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009 Paolo Baggia 60


External event on a new session:
the Outbound Call

A particular request arrived to Call Control from an external entity;


A particular CCXML service associated with the received event is started and
a set of operations between Call Control Adapter, Call Control and Dialog
Server is activated: the outbound call is so placed
outbound call request

Call Control CCXML VoiceXML


Adapter Interpreter Interpreter
Create a call

connection progressing …
Prepare a dialog

prepared

connection connected
Start the prepared dialog

Google TechTalk – Mar 6th, 2009 Paolo Baggia 61


External event on a session:
dialog termination request
An external entity performs a HTTP POST request towards the CCXML
Interpreter Context, specifying a sessionid, requesting the termination of a
particular dialog;
The CCXML check the session id, if this is valid then CCXML Interpreter
injects the event received in the session;
The CCXML service has a transition on that event and performs the dialog
termination on a particular dialog identifier;
Dialog termination request

Call Control CCXML VoiceXML


Adapter Interpreter Interpreter

It depends on dialogterminate (dialogid)


dialog.exit event
management
dialog.exit
disconnect(connId) dialogprepare

Google TechTalk – Mar 6th, 2009 Paolo Baggia 62


Loading different CCXML documents:
<fetch> and <goto> elements

<fetch> and <goto> elements are used respectively to asynchronously fetch


content identified by the attributes of the <fetch> and to go in a fetched
document, if it’s successfully loaded;

CCXML - MODULARIZATION
Interpreter - SOURCE EXEMPLIFICATION
- MORE READABILITY

<fetch
next="'http://../Fetch/doc1.ccxml'"
type="'application/ccxml+xml'"
fetchid="result"/>
fetch the document "doc1.ccxml"

fetch.done / error.fetch
The first event occurred
in a new document
is ccxml.loaded
goto into the new document /
continue to work on the same dialog

http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009 Paolo Baggia 63


Simple CCXML Document
<?xml version="1.0" encoding="UTF-8"?>
<ccxml version="1.0" xmlns="http://www.w3.org/2002/09/ccxml">
<var name="currentState"/>
<var name="myDialogId"/>
<var name="myConnId"/>
<eventprocessor statevariable="currentState">
<transition event="connection.alerting">
<assign name="myConnId" expr="event$.connectionid"/>
<accept connectionid="event$.connectionid"/>
</transition>
<transition event="connection.connected">
<dialogstart src="'http://www.example.com/flight.vxml'"
connectionid="myConnId" dialogid="myDialogId"/>
</transition>
<transition event="dialog.started">
<log expr="’VoiceXML appl is running now’"/>
</transition>
<transition event="connection.disconnected">
<dialogterminate dialogid="myDialogId"/>
</transition>
<transition event="dialog.exit">
<disconnect connectionid="myConnId"/>
</transition>
<transition event="*">
<log expr="'Closing, unexpected:'+ event$.name"/>
<exit/>
</transition>
</eventprocessor>
</ccxml>

Google TechTalk – Mar 6th, 2009 Paolo Baggia 64


CCXML 1.0 – Next Steps

CCXML specification is a Last Call Working Draft, all the feature


requests and clarifications have been addressed;

An Implementation Report test suite is under development;

It is very close to be published as W3C Candidate Recommendation;

Internal or external companies will be invited to send implementation


report on their CCXML platform;

After that, CCXML 1.0 specification will be able to become Proposed


Recommendation and then W3C Recommendation.

http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009 Paolo Baggia 65


Speech Interface Framework
Tour Complete!

Google TechTalk – Mar 6th, 2009 Paolo Baggia 66


Speech Interface Framework - End of 2009
(by Jim Larson)

Semantic Interpretation for


Speech Recognition (SISR)

N-gram Grammar ML VoiceXML 2.1


EMMA 1.0
Speech Recognition Natural Language
Grammar Spec. (SRGS) VoiceXML 2.0
Semantics ML

Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer

Pronunciation Lexicon Dialog


Specification (PLS) Manager

User Pre-recorded Audio Player


Telephone
Media System
Planning
Language
TTS
Generation

Speech Synthesis Reusable Components Call Control XML


Markup Language (SSML) (CCXML)

Google TechTalk – Mar 6th, 2009 Paolo Baggia 67


Architectural Changes

.grxml/.gram, .pls
VoiceXML architecture

ASR / DTMF
.vxml
VoiceXML Web
User
Browser HTTP Applic.
TTS / Audio
VoiceXML
platform

.ssml, .wav/.mp3, .pls

Google TechTalk – Mar 6th, 2009 Paolo Baggia 68


VoxNauta – Internal Architecture

Google TechTalk – Mar 6th, 2009 Paolo Baggia 69


Loquendo MRCP Server/LSS 7.0 Architecture

Load Balancer

RTSP SIP
(MRCPv1) MRCP v2
(SDP)

RTP RTSP Parser SIP MRCP v2


parser
MRCP v1 Parser SDP

Management Graphic
MP (SNMP)
Management
Configuration Consolle
Config files
AP
Interf. MRCP v1/v2 Server
Audio AP Logger Log files
Provider API
OS Win32/Linux
NLSML / EMMA

TTS & ASR interface

TTS and ASR API TTS and ASR API

LTTS LASR LASR-SV

Google TechTalk – Mar 6th, 2009 Paolo Baggia 70


IETF MRCP Protocols

Media Resource Control Protocol MRCP are IETF standards


 MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on
RTSP/RTP
 MRCPv2 is Internet Draft,
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP
offering the new audio recording and Speaker Verification
functionalities
Optimized client-server solution for the large-scale deployment of
speech technologies in the telephony field, such as call centers,
CRM, news and email-reading, self-service applications, etc.
Allows standard interface of speech technologies in all IVR platforms

For more information read:


Dave Burke, Speech Processing for IP Networks. Media
Resource Control Protocol (MRCP), ed. Wiley

Google TechTalk – Mar 6th, 2009 Paolo Baggia 71


VoiceXML in a Call Center
PBX

Fixed/
Mobile Optional
Network
Voice Gateway for
Non SIP PBX

VOXNAUTA IVR

ACD

WEB CTI Data


Server Server Server

Google TechTalk – Mar 6th, 2009 Operators Paolo Baggia 72


VoiceXML in the IMS Architecture

TDM protocols
VOICE SIP protocols
Fixed/ RTP
Mobile GATEWAY
Network VoiceXML on HTTPS

VOXNAUTA MRF

IP
Network

Application Server
Google TechTalk – Mar 6th, 2009 Paolo Baggia 73
Overview

 A Bit of History

 W3C Speech Interaction Framework Today


 ASR/DMTF
 TTS
 Lexicons
 Voice Dialog and Call Control
 Voice Platforms and Next Evolutions

W3C Multimodal Interaction Today


MMI Architecture
EMMA and InkML
A language for Emotions

 Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 74
Modes, Modalities and Technologies

 Speech
 Audio
 Stylus
 Touch
 Accelerometer
 Keyboard/keypad
 Mouse/touchpad
 Camera
 Geolocation
 Handwriting recognition
 Speaker verification
 Signature verification
 Fingerprint identification
 ….

Google TechTalk – Mar 6th, 2009 Paolo Baggia 75


Complement and Supplement

Speech Visual
- Transient - Persistent
- Linear - Spatial
- Hands and Eyes-Free - Eyes
- Suffers Noise - Suffers Light Conditions

Enable to choose among different modalities or to mix


them
Adaptable to different social, environmental conditions or
to user preference

Google TechTalk – Mar 6th, 2009 Paolo Baggia 76


GUI VUI MUI
or
MMUI

Google TechTalk – Mar 6th, 2009 Paolo Baggia 77


MMI has an Intrinsic Complexity

Interaction
Manager
speech
speech
text fingerprint
fingerprint
text
mouse Face
Face
mouse
identification
identification
handwriting geolocation
geolocation
handwriting Speaker
Speaker
Vital verification
verification
accelerometer
accelerometer Vital
signs
signs
Sensor Identification
User intent

photograph video
video
photograph
drawing Audio
Audio
drawing recording
recording

Deborah Dahl, Voice Search 2009


Recording
Google TechTalk – Mar 6th, 2009 Paolo Baggia 78
MMI can Include Many Different Technologies

Touchscreen Accelerometer

Speech Interaction
Geolocation
recognition Manager

Fingerprint
Keypad
recognition

Handwriting
recognition

Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 79


Uniform Representation for MMI

Getting everything to work together is complicated.


One simplification is to represent the same information
from different modalities in the same format.
The need a common language for representing the
same information from different modalities

 EMMA (Extensible MultiModal Annotation) 1.0


A uniform representation for multimodal information

Google TechTalk – Mar 6th, 2009 Paolo Baggia 80


Touchscreen Accelerometer

EMMA EMMA

Speech Interaction
recognition EMMA EMMA Geolocation
Manager

EMMA EMMA
EMMA
Fingerprint
Keypad
recognition

Handwriting
recognition

Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 81


EMMA Structural Elements

EMMA Elements
Provide containers for application
semantics and for multimodal
annotation emma:emma

<emma:emma …> emma:interpretation


<emma:one-of>
<emma:interpretation>
… emma:one-of
</emma:interpretation>
<emma:interpretation> emma:group

</emma:interpretation>
</emma:one-of> emma:sequence
</emma:emma>
emma:lattice

http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 82
EMMA Annotations

Characteristics and processing of input, e.g.:


emma:tokens token of input
emma:process reference to processing
emma:no-input lack of input
emma:uninterpreted uninterpretable input
emma:lang human language of input

emma:signal reference to signal

emma:media-type media type

emma:confidence confidence scores


emma:source annotation of input source
emma:start emma:end Timestamps (absolute/relative)
emma:medium emma:mode medium, mode, and
emma:function function of input
emma:hook hook

http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 83
EMMA 1.0 – Example Travel Application

INPUT:
"I want to go from Boston
to Denver on March 11"

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 84


EMMA 1.0 – Same meaning

<emma:interpretation medium="acoustic" mode="voice"


id="int1">
<origin>Boston</origin>
<destination>Denver</destination> Speech
<date>11032009</date>
</emma:interpretation>

<emma:interpretation medium="tactile" mode="gui“


id="int1">
<origin>Boston</origin>
<destination>Denver</destination> Mouse
<date>11032009</date>
</emma:interpretation>

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 85


EMMA 1.0 – Handwriting Input

<emma:interpretation medium="tactile" mode="ink"


id="int1">
<origin>Boston</origin>
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 86


EMMA 1.0 – Biometrics Input

<emma:emma version="1.0"> <emma:emma version="1.0">


<emma:interpretation <emma:interpretation
id="int1" id="int1"
emma:confidence=".75" emma:confidence=".80"
emma:medium="visual" emma:medium="acoustic"
emma:mode="photograph" emma:mode="voice"
emma:verbal="false" emma:verbal="false"
emma:function="identification"> emma:function="identification">
<person>12345</person> <person>12345</person>
<name>Mary Smith</name> <name>Mary Smith</name>
</emma:interpretation> </emma:interpretation>
</emma:emma> </emma:emma>

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009 Paolo Baggia 87


EMMA 1.0 – Representing Lattices

Speech recognizers, Handwriting recognizers and other input


processing components may provide lattice output:

A graph encoding a range of possible recognition results or


interpretations

portland
today please
flights to austin from
7
1 2 3 4 5 oakland 6 8
boston tomorrow

From Michael Joshnston, AT&T Research


http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 88
EMMA 1.0 – Representing Lattices
Lattices can be represented using EMMA elements:
<emma:lattice emma:initial="?" emma:final="?">
<emma:arc emma:from="?" emma:to="?">

<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation>
<emma:lattice emma:initial="1" emma:final="8">
<emma:arc emma:from="1" emma:to="2">flights</emma:arc>
<emma:arc emma:from="2" emma:to="3">to</emma:arc>
<emma:arc emma:from="3" emma:to="4">boston</emma:arc>
<emma:arc emma:from="3" emma:to="4">austin</emma:arc>
<emma:arc emma:from="4" emma:to="5">from</emma:arc>
<emma:arc emma:from="5" emma:to="6">portland</emma:arc>
<emma:arc emma:from="5" emma:to="6">oakland</emma:arc>
<emma:arc emma:from="6" emma:to="7">today</emma:arc>
<emma:arc emma:from="7" emma:to="8">please</emma:arc>
<emma:arc emma:from="6" emma:to="8">tomorrow</emma:arc>
</emma:lattice>
</emma:interpretation>
</emma:emma>
From Michael Joshnston, AT&T Research
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 89
EMMA in Multimodal Framework
http://www.w3.org/TR/mmi-framework

EMMA

Google TechTalk – Mar 6th, 2009 Paolo Baggia 90


InkML 1.0 – Digital Ink

Ink Markup Language (InkML), http://www.w3.org/TR/InkML


Data format for presenting digital Ink (pen, stylus, etc)
Allows the input and processing of handwritings, gesture, sketches,
music, etc.
<ink>
<trace>
10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140,
13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135,
58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205
</trace>
<trace>
130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125,
152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200,
150 208, 163 210, 178 208, 192 201, 205 192, 214 180
</trace>
<trace>
227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134,
230 148, 234 162, 235 176, 238 190, 241 204
</trace>
<trace>
282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129,
291 143, 294 157, 294 171, 294 185, 296 199, 300 213
</trace>
<trace>
366 130, 359 143, 354 157, 349 171, 352 185, 359 197,
371 204, 385 205, 398 202, 408 191, 413 177, 413 163,
405 150, 392 143, 378 141, 365 150
</trace>
</ink>

http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 91
InkML 1.0 – Status and Advances

Rich annotation for Ink:


Trace, Trace formats and Trace collections
Contextual information
Canvases
Etc.

Result of classification of InkML traces may be a semantic


representation in EMMA 1.0

Current status is Last Call Working Draft, next will be Candidate


Recommendation with release of an Impl. Report test-suite
Raising interest from major industries

http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 92
MMI Architecture Specification

“Multimodal Architecture and Interfaces“, W3C Working Draft,


http://www.w3.org/TR/mmi-arch/

Runtime Framework provides Delivery Interaction Data


the basic infrastructure and Context
Component Manager Component
controls communication among
the constituents. Runtime Framework

Interaction Manager (IM)


Modality Component API
coordinates Modality
Components (MCs) by life-cycle
Modality Modality
events and contains the shared Component 1 Component N
data (context).
Event-based communication
between IM and MCs.

http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008

Google TechTalk – Mar 6th, 2009 Paolo Baggia 93


MMI Arch – Laboratory Implementation

Implementation of components using W3C markup languages.

Delivery Interaction Data


Context Manager Component
Component
SCXML
Runtime Framework

Modality Component API Modality Component API

HTML
Modality
Component 1
VoiceXML
Modality
Component N
for GUI for VUI

http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008

Google TechTalk – Mar 6th, 2009 Paolo Baggia 94


MMI Arch – Laboratory Implementation

SCXML based Interaction Manager.


VoiceXML + HTML modality components.

SCXML interpreter
Server
HTTP I/O Processor

Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)

CCXML/VoiceXML Server
Browser
HTML Browser
Client Telephony interface
Phone Client

GUI modality component Voice modality component

http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008

Google TechTalk – Mar 6th, 2009 Paolo Baggia 95


MMI Architecture – Open Issues

Profiles

Start-up, Registration, Delegation


in distributed environment

Transport of Events

Extensibility of Events

http://www.w3.org/TR/mmi-arch/

Google TechTalk – Mar 6th, 2009 Paolo Baggia 96


Emotion in Wikipedia

From Wikipedia definition:

“An emotion is a mental and physiological state associated with a


wide variety of feelings, thoughts, and behaviours. It is a prime
determinant of the sense of subjective well-being and appears to play
a central role in many human activities. As a result of this generality,
the subject has been explored in many, if not all of the human
sciences and art forms. There is much controversy concerning how
emotions are defined and classified.”

General goal: Make interaction between humans and machines more


natural for the humans

Machines should become able:


• to register human emotions (and related states)
• to convey emotions (and related states)
• to “understand” the emotional relevance of events

Google TechTalk – Mar 6th, 2009 Paolo Baggia 97


Emotional States are Numerous

adventurous 
Active bellicose
 AROUSED  hostile
lusting
ASTONISHE  TENSE
ALARMED hateful
envious
triumphant D  ANGRY  AFRAID
EXCITED enraged defiant
Obstructive
Hi Power/Control 
self-
confident
ambitious conceited ANNOYED
 contemptuo
angry
courageous feeling jealous Angry us 
superior

convinced
indignant
DISTRESS
disgusted
ED
loathing
Scherer et al.
FRUSTRATED

elated

DELIGHTEenthusiasti
c light- impatient

discontente
Univ. Geneva
D suspicious bitter
hearted d
determined
amused excited insulted
HAPPY 
joyous passionate
Happy interested
expectant
bored
distrustful
startled

Positive feel well


PLEASED
impressed disappointe
Negative
 amourous astonished apathetic d
MISERABL
GLAD  dissatisfied
E
confident
taken
aback
content hopeful
worried
relaxed  SAD
uncomforta
longing feel guilt
solemn attentive ble
DEPRESS 
SERENE  despondent
languid ashamedED  desperate
GLOOMY
CONTENT
 AT EASEfriendly
SATISFIED

 RELAXED
contemplati
polite
pensive
serious
embarrass Sad
 CALM  ve ed wavering lonely
melancholi
hesitant

Conducive peaceful
c
 BORED Lo Power/Control
anxious
conscientio
sad dejected insecure
us
empathic DROOPY
reverent  doubtful
SLEEPY 
 TIRED

Passive
Google TechTalk – Mar 6th, 2009 Paolo Baggia 98
HUMAINE project

HUMAINE project
European Network of Excellence
Activity: 01/2004 - 12/2007
33 partner institutions from many disciplines

Today: HUMAINE Association (since June 2007)


125 members
Web-site: http://emotion-research.net

Google TechTalk – Mar 6th, 2009 Paolo Baggia 99


Online Speaker Classification

Classification Techniques
Principal Component Analysis (PCA) or Support Vector Machines (SVM): use “kernel
Linear Discriminant Analysis (LDA) – functions” to separate non-linear decision
preprocessing step to reduce feature vector boundaries
dimension Classification and Regression Trees (CART)
K-nearest Neighbor Hidden Markov Models (HMMs) used to
Gaussian Mixture Models: model training model temporal structure
data as Gaussian densities
Artificial Neural Networks (ANN), e.g. MLP:
interesting training algorithms Felix Burkhardt, Colloqium Hochschule Zittau/Görlitz 4.8.2008, Seite 1.

Google TechTalk – Mar 6th, 2009 Paolo Baggia 100


Expressive TTS – Two Approaches

Text+expressive tags

1. Different speech style 1


databases, one for each
expressive style:
Selection Waveform
style 2
 Effective solution,
feasible only for a very
limited range of emotions style n

Text+expressive tags
2. Speech signal
manipulation according
to style dependent
prosodic models Prosodic Model
neutral Signal
Waveform
 Flexible solution, but style Processing
requires accurate models Selection
and effective signal
processing capabilities

From Enrico Zovato, Loquendo

Google TechTalk – Mar 6th, 2009 Paolo Baggia 101


Expressive TTS – Example Prosodic Patterns

Synthesis of two basic emotional styles through prosodic modification:


 different intonation contours
 different acoustic units duration

500

POS (“happy”)
400
Frequency (Hz)

NEG (“sad”)
300

200

100

0
0 1.8
Time (s)
NEG POS

Male-UK
From Enrico Zovato, Loquendo Female-UK
Google TechTalk – Mar 6th, 2009 Paolo Baggia 102
Emotions in ECAs

From Piero Cosi, CNR,


Padova

Google TechTalk – Mar 6th, 2009 Paolo Baggia 103


W3C Emotion Incubator

“The W3C Incubator Activity fosters rapid development,


on a time scale of a year or less, of new Web-related
concepts. Target concepts include innovative ideas for
specifications, guidelines, and applications that are
not (or not yet) clear candidates as Web standards
developed through the more thorough process afforded by
the W3C Recommendation Track.”

W3C Emotion Incubator Aims:


First Charter XG (2006-2007):
“...to investigate the prospects of defining a general-purpose Emotion annotation and
representation language...”
“...which should be usable in a large variety of technological contexts where
emotions need to be represented.”
Second Charter XG (Nov. 2007 – Nov. 2008):
Prioritize the requirements;
Release a first specification draft;
Illustrate how to combine the Emotion Markup Language with existing markup
languages.

Google TechTalk – Mar 6th, 2009 Paolo Baggia 104


W3C Emotion Incubator – Members

Chairman: Marc Schröder, DFKI


W3C Members: Invited Experts:
DFKI Emotion AI
Loquendo Univ. Paris 8
Deutsche Telekom Uuniv. Basque Country
SRI International Univ. C. Cork
NTUA OFAI, Austria
Fraunhofer IPCA, Portugal
Chinese Acad. Science Tech.Univ. Munich

Web space: http://www.w3.org/2005/Incubator/emotion


Results:
• Use case description document
• Requirements document
• Final Report (20 Nov 2008): Elements of an EmotionML 1.0
 http://www.w3.org/2005/Incubator/emotion/XGR-emotionml/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 105
W3C Emotion Incubator – EmotionML 1.0

Document structure:
container element (<emotionml>), single emotion annotation (<emotion>)

Representation of emotions:
<category> element, <dimensions> element, <appraisals> element,
<action-tendency> element, <intensity> element

Meta information:
confidence attribute, <modality> element, <metadata> element

Links and time:


<link> element, <timing> element

Scale values
value attribute, <traces> element

Google TechTalk – Mar 6th, 2009 Paolo Baggia 106


EmotionML 1.0 – Examples
Expression of emotions in SSML 1.1:

<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:emo="http://www.w3.org/2008/11/emotionml"
xml:lang="en-US">
<s>
<emo:emotion>
<emo:category set="everydayEmotions" name="doubt"/>
<emo:intensity value="0.4"/>
</emo:emotion>

Do you need help?


</s>
</speak>

Detection of emotions in EMMA 1.0:


<emma:emma version="1.0" xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns="http://www.w3.org/2008/11/emotionml">
<emma:interpretation start="12457990" end="12457995" mode="voice" verbal="false">

<emotion>
<intensity value="0.1" confidence="0.8"/>
<category set="everydayEmotions" name="boredom" confidence="0.1"/>
</emotion>

</emma:interpretation>
</emma:emma>

Google TechTalk – Mar 6th, 2009 Paolo Baggia 107


Overview

 A Bit of History

 W3C Speech Interaction Framework Today


 ASR/DMTF
 TTS
 Lexicons
 Voice Dialog and Call Control
 Voice Platforms and Next Evolutions

 W3C Multimodal Interaction Today


 MMI Architecture
 EMMA and InkML
 A language for Emotions

Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 108
W3C VBWG/MMIWG – Next Future

Spec for the next generation of Voice Browsing

SCXML 1.0

VoiceXML 3.0

Google TechTalk – Mar 6th, 2009 Paolo Baggia 109


State Charts - SCXML

State Chart XML (SCXML):


http://www.w3.org/TR/2008/WD-scxml-20080516/
Powerful State-Machine Language
Based on David Harel’s State Charts (see his book)
Adopted by in UML
Standard under development by W3C VBWG
http://www.w3.org/TR/scxml/

States, Transitions, Events


Data model extends basic finite state automaton
Conditions on transitions
Nested States
Represents task decomposition
In multiple dependent states at same time
Parallel States
Represent fork/join logic

Wide interest:
VBWG, MMI WG, Other W3C groups, Universities, Industries
Already available Open Source Implementations

Google TechTalk – Mar 6th, 2009 Paolo Baggia 110


SCXML 1.0 – Parallel State Charts

Google TechTalk – Mar 6th, 2009 Paolo Baggia 111


SCXML as MMI Interaction Manager

SCXML Interaction Manager

Vo
i ce
Mo
d alit
y
Modality
ure
Gest

Visual Modality

Google TechTalk – Mar 6th, 2009 Paolo Baggia 112


SCXML for VoiceXML 3.0

SCXML Interaction Manager

Vo
i ce
Mo
d alit
y Modality
ure
Gest

Visual Modality

Google TechTalk – Mar 6th, 2009 Paolo Baggia 113


SCXML 1.0 – Open Issues

Data model:
ECMA Script (ECMA-262) or other formats?

Definition of Profiles

Other

Google TechTalk – Mar 6th, 2009 Paolo Baggia 114


Re-Thinking VoiceXML – VoiceXML 3.0

Well-founded:
From syntactic description to a semantic model

Extensible:
SIV, EMMA support, rich media, VCR control, etc.

Profiled:
light profile (mobile?), media profile (scalability), VoiceXML 2.1
profile (interoperability), etc.

Flexibility:
Customization of FIA (Form Interpretation Algorithm)

Google TechTalk – Mar 6th, 2009 Paolo Baggia 115


VoiceXML 3.0 – Separation of Concerns

SCXML 1.0
Application and interaction logic

VoiceXML 3.0:
Voice Interaction only, under control of SCXML

VoiceXML 3.0 has been published as a First Working


Draft, http://www.w3.org/TR/2008/WD-voicexml30-20081219/
 Send public comments

Google TechTalk – Mar 6th, 2009 Paolo Baggia 116


THANK YOU

for clarifications or questions:

paolo.baggia@loquendo.com

Google TechTalk – Mar 6th, 2009 Paolo Baggia 117


THANK YOU
For more information please:
Keep an eye on: www.loquendo.com Loquendo S.p.A.
745 Fifth Ave, 27th Floor
Contact: paolo.baggia@loquendo.com New York, NY 10151
USA
Tel. +1 212.310.9075
Keep in touch with Loquendo news, subscribe to Fax. +1 212.310.9001
the Loquendo Newsletter www.loquendo.com
Try our interactive TTS demo: insert your text,
choose a language, and listen Loquendo S.p.A.
Via Olivetti, 6
The latest News at a click
10148 TORINO
Consult the Loquendo Newsletter online Italy
Keep up to date on events and initiatives Tel. +39 011 291 3111
Fax +39 011 291 3199
For further information, fill in our Contacts Form www.loquendo.com

Google TechTalk – Mar 6th, 2009 Paolo Baggia 118

Das könnte Ihnen auch gefallen