Beruflich Dokumente
Kultur Dokumente
Paolo Baggia
Director of International Standards
Google TechTalk
A Bit of History
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 2
Company Profile
W3C charters
W3C charters
Voice Browser
Multimodal Interaction
WG
WG
EMMA 1.0
VoiceXML By Cisco, Comverse,
SALT Forum W3C Rec
Forum Birth Intel, Microsoft, Philips,
Birth SpeechWorks, PLS 1.0
By AT&T, IBM, W3C REC
Lucent, Motorola, 2007
1998 2000 2004
2008 2009
1999 2002
SSML 1.0
W3C Voice W3C1.0
SRGS Rec SISR 1.0
Browser VoiceXML 1.0 W3C 2.0
VoiceXML Rec VoiceXMLRec
W3C 2.0
Workshop Released W3C Rec W3C Rec
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
ASR / DTMF
Speech Proprietary
User SCE
Applic.
TTS / Audio
Proprietary
platform
.grxml/.gram, .pls
VoiceXML architecture
ASR / DTMF
.vxml
VoiceXML Web
User
Browser HTTP Applic.
TTS / Audio
VoiceXML
platform
Before After
• Proprietary platforms • Standard VoiceXML
(HW & SW) platforms
• Proprietary • Standards for Speech
applications (by Technologies
proprietary SCE) • Standard tools for
• Mainly DTMF and VoiceXML applications
pre-recorded prompts • Integration of DTMF
• First attempts to add and ASR
speech into IVR • Still predominance of
DTMF, but more and
more speech
applications
A Bit of History
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 14
Standards for ASR and DTMF
SRGS 1.0, SISR 1.0
SYNTAX SEMANTICS
Defines constraints on Speech Describes how to
admissible sentences for grammar produce results after
a specific recognition turn an utterance is recognized
SRGS
SRGS SISR
SISR
ABNF
ABNF XML
XML literal
literal script
script
voice
voice dtmf
dtmf
http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/
</grammar>
</grammar>
Gr2.gram
Literal
Gr1.grxml Gr41.grxml
Literal
Script
Gr3.grxml
Script
Gr42.gram
Script
Text-to-
Structure Text Prosody Waveform
Phoneme
Analysis Normalization Analysis Production
Conversion
Markup support:
Markup support: <emphasis>, <break>, <prosody>
<say-as> for date, time, phone number, numbers Non-Markup support:
<sub> for acronyms and transliterations automatically generate prosody through analysis of
Non-Markup support: document structure and sentence syntax
automatically identify and convert constructs
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 24
SSML 1.0 – Language description (I)
version attribute
Document Structure SSML namespace attribute
<speak> root element
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
<p>I don't speak Japanese.</p>
<p xml:lang="ja">Nihongo-ga wakarimasen.</p>
Languages </speak>
- <emphasis> element
requests that the contained text be spoken with emphasis
level attribute can set it to strong, moderate, reduced, or none
- <break> element
controls the pausing between words
time attribute with two kind of values:
Time expressions “5s”, “20ms”
strength attribute with values:
none, x-weak, weak, medium (default value), strong, or x-strong
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 26
SSML 1.0 – Language description (III)
Prosody
<prosody> element
permits control of the pitch, speaking rate and volume of the
speech output.
Other elements
<audio> element - to play an audio file
<mark> element - to place a marker into the text/tag sequence
<desc> element - to provide a description of a non-speech audio
source in <audio>
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 27
Towards SSML 1.1 – Motivations
Internationalization needs:
Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
Results:
No major needs for Eastern and Western European languages
Many issues for Far East languages (Mandarin, Japanese, Korean)
Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many
Indian languages
Mark input with or without vowels
Mark the transliteration schema used for input
<w> element
Lexicon extensions
<lookup> element
permits control of the pitch, speaking rate and volume of the
speech output.
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 29
Pronunciation Lexicon
PLS 1.0
Pronunciation Lexicon
A mapping between words (or short phrases), their written representations,
and their pronunciations suitable for use by an ASR engine or a TTS
engine
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 31
PLS 1.0 – Language Overview
SSML 1.0 and SRGS 1.0 documents can reference one or more PLS
documents
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 32
PLS 1.0 – An Example
<lexeme>
<grapheme>Sepulveda</grapheme>
<phoneme>səˈˈpȜ
ȜlvǺǺdə</phoneme>
</lexeme>
<lexeme>
<grapheme>W3C</grapheme>
<alias>World Wide Web Consortium</alias>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 33
PLS 1.0 – Used for TTS
SSML 1.0
<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.0" … xml:lang="en-US">
<lexicon uri="http://www.example.com/SSMLexample.pls"/>
The title of the movie is: "La vita è bella" (Life is beautiful),
which is directed by Benigni.
</speak>
PLS 1.0
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" … alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>La vita è bella</grapheme>
<phoneme>ˈˈlǡǡ ˈviːːȎə ˈȤeǺǺ ˈbǫǫlə</phoneme>
</lexeme>
<lexeme>
<grapheme>Benigni</grapheme>
<phoneme>bǫǫˈniːːnji</phoneme>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 34
PLS 1.0 – Used for ASR
SRGS 1.0
<?xml version="1.0" encoding="UTF-8"?>
<grammar version="1.0“ xml:lang="en-US" root="movies" mode="voice">
<lexicon uri="http://www.example.com/SRGSexample.pls"/>
<rule id="movies" scope="public">
<one-of>
<item>Terminator 2: Judgment Day</item>
<item>Pluto's Judgement Day</item>
</one-of>
</rule>
</grammar>
PLS 1.0
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0" … alphabet="ipa" xml:lang="en-US">
<lexeme>
<grapheme>judgment</grapheme>
<grapheme>judgement</grapheme>
<phoneme>ˈˈdʒȜ
Ȝdʒ.mənt</phoneme>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 35
Examples of Use
Multiple orthographies
Homophones
Homographs
Integration in ASR/TTS
SSML 1.1 will interoperate with PLS 1.0
SRGS 1.0 still missing support of role attribute for PLS 1.0
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 37
Pronunciation Alphabets
IPA, SAMPA
Execution is synchronous
Only disconnect event is handled (somewhat) asynchronous
Event-driven:
<nomatch>, <noinput> user’s input event handling
<catch>, <throw> generalized event mechanism
connection.* call event handling
error.* error event handling
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 44
VoiceXML 2.1 – Extended Features
Dynamically referencing grammars and scripts:
<grammar expr="…">, <script expr="…">
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 46
A Drawback of VoiceXML 2.0
The transitions are the only way to return data to the Web Application
(if the VoiceXML is dynamic)
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 47
Advantages of VoiceXML 2.1 - AJAX
Attributes:
name the variable to be filled with the DOM of the retrieved data
scr or srcexpr the URI of the location of the XML data
namelist the list of variables to be submitted
method either ‘get’ or ‘post’
enctype media encoding
fetch and caching attributes
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 49
VoiceXML 2.1 – <foreach> Element
Attributes:
array ECMAScript expression that must evaluate to ECMAScript array
item the variable that stores the element to be processed
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 50
VoiceXML – Final Remarks
… but:
Mainly system driven applications are actually deployed
New challenges to incorporate more powerful dialog strategies,
mixed-initiative are under discussion.
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 51
VoiceXML Resources
VoiceXML modularization
Conferencing management
[…]
CCXML <transition
connection.alerting event="connection.alerting">
Interpreter
[…]
</transition>
event$ <transition
event="connection.disconnected">
[…]
name:’connection.alerting’;
</transition>
connectionid:‘0239023901903993’;
eventid:’00001’; ....
…..
http://www.w3.org/TR/ccxml
Any error that can occur during the phone call can be managed by
CCXML service (connection.failed, error.connection events)
connection.alerting
http://www.w3.org/TR/ccxml
Performing a transfer
command1
answer1
[…]
transfer complete …
Event
management
Event management result
http://www.w3.org/TR/ccxml
connection progressing …
Prepare a dialog
prepared
connection connected
Start the prepared dialog
CCXML - MODULARIZATION
Interpreter - SOURCE EXEMPLIFICATION
- MORE READABILITY
<fetch
next="'http://../Fetch/doc1.ccxml'"
type="'application/ccxml+xml'"
fetchid="result"/>
fetch the document "doc1.ccxml"
fetch.done / error.fetch
The first event occurred
in a new document
is ccxml.loaded
goto into the new document /
continue to work on the same dialog
http://www.w3.org/TR/ccxml
http://www.w3.org/TR/ccxml
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
.grxml/.gram, .pls
VoiceXML architecture
ASR / DTMF
.vxml
VoiceXML Web
User
Browser HTTP Applic.
TTS / Audio
VoiceXML
platform
Load Balancer
RTSP SIP
(MRCPv1) MRCP v2
(SDP)
Management Graphic
MP (SNMP)
Management
Configuration Consolle
Config files
AP
Interf. MRCP v1/v2 Server
Audio AP Logger Log files
Provider API
OS Win32/Linux
NLSML / EMMA
Fixed/
Mobile Optional
Network
Voice Gateway for
Non SIP PBX
VOXNAUTA IVR
ACD
TDM protocols
VOICE SIP protocols
Fixed/ RTP
Mobile GATEWAY
Network VoiceXML on HTTPS
VOXNAUTA MRF
IP
Network
Application Server
Google TechTalk – Mar 6th, 2009 Paolo Baggia 73
Overview
A Bit of History
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 74
Modes, Modalities and Technologies
Speech
Audio
Stylus
Touch
Accelerometer
Keyboard/keypad
Mouse/touchpad
Camera
Geolocation
Handwriting recognition
Speaker verification
Signature verification
Fingerprint identification
….
Speech Visual
- Transient - Persistent
- Linear - Spatial
- Hands and Eyes-Free - Eyes
- Suffers Noise - Suffers Light Conditions
Interaction
Manager
speech
speech
text fingerprint
fingerprint
text
mouse Face
Face
mouse
identification
identification
handwriting geolocation
geolocation
handwriting Speaker
Speaker
Vital verification
verification
accelerometer
accelerometer Vital
signs
signs
Sensor Identification
User intent
photograph video
video
photograph
drawing Audio
Audio
drawing recording
recording
Touchscreen Accelerometer
Speech Interaction
Geolocation
recognition Manager
Fingerprint
Keypad
recognition
Handwriting
recognition
EMMA EMMA
Speech Interaction
recognition EMMA EMMA Geolocation
Manager
EMMA EMMA
EMMA
Fingerprint
Keypad
recognition
Handwriting
recognition
EMMA Elements
Provide containers for application
semantics and for multimodal
annotation emma:emma
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 82
EMMA Annotations
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 83
EMMA 1.0 – Example Travel Application
INPUT:
"I want to go from Boston
to Denver on March 11"
portland
today please
flights to austin from
7
1 2 3 4 5 oakland 6 8
boston tomorrow
<emma:emma version="1.0"
xmlns:emma="http://www.w3.org/2003/04/emma">
<emma:interpretation>
<emma:lattice emma:initial="1" emma:final="8">
<emma:arc emma:from="1" emma:to="2">flights</emma:arc>
<emma:arc emma:from="2" emma:to="3">to</emma:arc>
<emma:arc emma:from="3" emma:to="4">boston</emma:arc>
<emma:arc emma:from="3" emma:to="4">austin</emma:arc>
<emma:arc emma:from="4" emma:to="5">from</emma:arc>
<emma:arc emma:from="5" emma:to="6">portland</emma:arc>
<emma:arc emma:from="5" emma:to="6">oakland</emma:arc>
<emma:arc emma:from="6" emma:to="7">today</emma:arc>
<emma:arc emma:from="7" emma:to="8">please</emma:arc>
<emma:arc emma:from="6" emma:to="8">tomorrow</emma:arc>
</emma:lattice>
</emma:interpretation>
</emma:emma>
From Michael Joshnston, AT&T Research
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 89
EMMA in Multimodal Framework
http://www.w3.org/TR/mmi-framework
EMMA
http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 91
InkML 1.0 – Status and Advances
http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 92
MMI Architecture Specification
HTML
Modality
Component 1
VoiceXML
Modality
Component N
for GUI for VUI
SCXML interpreter
Server
HTTP I/O Processor
Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)
CCXML/VoiceXML Server
Browser
HTML Browser
Client Telephony interface
Phone Client
Profiles
Transport of Events
Extensibility of Events
http://www.w3.org/TR/mmi-arch/
adventurous
Active bellicose
AROUSED hostile
lusting
ASTONISHE TENSE
ALARMED hateful
envious
triumphant D ANGRY AFRAID
EXCITED enraged defiant
Obstructive
Hi Power/Control
self-
confident
ambitious conceited ANNOYED
contemptuo
angry
courageous feeling jealous Angry us
superior
convinced
indignant
DISTRESS
disgusted
ED
loathing
Scherer et al.
FRUSTRATED
elated
DELIGHTEenthusiasti
c light- impatient
discontente
Univ. Geneva
D suspicious bitter
hearted d
determined
amused excited insulted
HAPPY
joyous passionate
Happy interested
expectant
bored
distrustful
startled
Conducive peaceful
c
BORED Lo Power/Control
anxious
conscientio
sad dejected insecure
us
empathic DROOPY
reverent doubtful
SLEEPY
TIRED
Passive
Google TechTalk – Mar 6th, 2009 Paolo Baggia 98
HUMAINE project
HUMAINE project
European Network of Excellence
Activity: 01/2004 - 12/2007
33 partner institutions from many disciplines
Classification Techniques
Principal Component Analysis (PCA) or Support Vector Machines (SVM): use “kernel
Linear Discriminant Analysis (LDA) – functions” to separate non-linear decision
preprocessing step to reduce feature vector boundaries
dimension Classification and Regression Trees (CART)
K-nearest Neighbor Hidden Markov Models (HMMs) used to
Gaussian Mixture Models: model training model temporal structure
data as Gaussian densities
Artificial Neural Networks (ANN), e.g. MLP:
interesting training algorithms Felix Burkhardt, Colloqium Hochschule Zittau/Görlitz 4.8.2008, Seite 1.
Text+expressive tags
Text+expressive tags
2. Speech signal
manipulation according
to style dependent
prosodic models Prosodic Model
neutral Signal
Waveform
Flexible solution, but style Processing
requires accurate models Selection
and effective signal
processing capabilities
500
POS (“happy”)
400
Frequency (Hz)
NEG (“sad”)
300
200
100
0
0 1.8
Time (s)
NEG POS
Male-UK
From Enrico Zovato, Loquendo Female-UK
Google TechTalk – Mar 6th, 2009 Paolo Baggia 102
Emotions in ECAs
Document structure:
container element (<emotionml>), single emotion annotation (<emotion>)
Representation of emotions:
<category> element, <dimensions> element, <appraisals> element,
<action-tendency> element, <intensity> element
Meta information:
confidence attribute, <modality> element, <metadata> element
Scale values
value attribute, <traces> element
<?xml version="1.0"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:emo="http://www.w3.org/2008/11/emotionml"
xml:lang="en-US">
<s>
<emo:emotion>
<emo:category set="everydayEmotions" name="doubt"/>
<emo:intensity value="0.4"/>
</emo:emotion>
<emotion>
<intensity value="0.1" confidence="0.8"/>
<category set="everydayEmotions" name="boredom" confidence="0.1"/>
</emotion>
</emma:interpretation>
</emma:emma>
A Bit of History
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 108
W3C VBWG/MMIWG – Next Future
SCXML 1.0
VoiceXML 3.0
Wide interest:
VBWG, MMI WG, Other W3C groups, Universities, Industries
Already available Open Source Implementations
Vo
i ce
Mo
d alit
y
Modality
ure
Gest
Visual Modality
Vo
i ce
Mo
d alit
y Modality
ure
Gest
Visual Modality
Data model:
ECMA Script (ECMA-262) or other formats?
Definition of Profiles
Other
Well-founded:
From syntactic description to a semantic model
Extensible:
SIV, EMMA support, rich media, VCR control, etc.
Profiled:
light profile (mobile?), media profile (scalability), VoiceXML 2.1
profile (interoperability), etc.
Flexibility:
Customization of FIA (Form Interpretation Algorithm)
SCXML 1.0
Application and interaction logic
VoiceXML 3.0:
Voice Interaction only, under control of SCXML
paolo.baggia@loquendo.com