Analysis of Spoken Dialog Systems: A Project Report On

A Project Report
on
Analysis of Spoken Dialog Systems

Submitted in partial fulfillment of
the requirements for the award of the degree of
Bachelor of Technology
in
Computer Science
Submitted by
Roll No Names of Students
4129441005 Tanpreet Kaur

4129441075 Twinkle
4129441076 Tanya
Under the guidance of

Ms. Ritu Singhal
Ms. Vimla Parihar
Department of Computer Science and Engineering

IP College for Women
University of Delhi,
New Delhi
April 2017
1
CERTIFICATE
This is to certify that the Project entitled, Analysis of Spoken Dialog Systems has
been submitted to the Department of Computer Science, IP College for Women,
University of Delhi by Twinkle, Tanpreet and Tanya, is satisfactory and authentic
piece of work that has been carried out under my supervision and guidance. This
work is done in the partial fulfillment of requirement for the award of degree of
Bachelor of Technology in Computer Science. The matter embodied in this Project
Report is a genuine work done by the students and has not been submitted to this
university or any other university/institute for the fulfillment of the requirement of
any course of study.

4129441075 Twinkle
4129441076 Tanya
Ms. Ritu Singhal

Ms. Vimla Parihar
(Project Guide)
Dr. Manju Bala

(Teacher Incharge)
Date:
2
ABSTRACT
Human-Machine spoken dialog differs from written dialog primarily due to the limi-
tations of current speech recognition systems and the intrinsic structure of the spoken
language dialog. Speech recognition systems have a limitation that may be explained
by the non-deterministic character of the recognition process including difficulties to
account for short and degraded messages. We will focus on the problem of modeling
and evaluating spoken language systems in the context of human-machine dialogs.
The intrinsic characteristics of the spoken dialog include the spontaneity of utter-
ances which yields a significant amount of redundant information, repetitions, self
corrections, hesitations, contradictions and even tendencies to stop the interlocutor.
They also include non-grammatical structure of human utterances. Finally, they in-
clude clarification and/or reformulation sub-dialogs that depend on the limitations
of the speech recognizer. This project report describes an ambitious project that
embeds human subjects in a spoken dialog system. It collects an efficient data set
including spoken dialog, human behavior and system features.
This project lays out the analysis of 3 spoken dialog systems and tries to know
how people manage the problems that arise under dialog during such restrictions.
3
Contents
1 DEFINITIONS, ACRONYMS AND ABBREVIATIONS 5
2 INTRODUCTION 6
2.1 WHY IS THE PARTICULAR TOPIC CHOSEN? . . . . . . . . . . . 7
2.2 OBJECTIVE AND SCOPE . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 METHODOLOGY: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 WHAT CONTRIBUTION WOULD THE PROJECT MAKE? . . . . 8
3 OVERALL DESCRIPTION 9
3.1 NON-FUNCTIONAL REQUIREMENTS . . . . . . . . . . . . . . . . 9
4 WORK DONE 10
4.1 Spoken Dialogue for Intelligent Tutoring Systems . . . . . . . . . . . 10
4.2 Predictive Performance Modeling . . . . . . . . . . . . . . . . . . . . 11
4.3 Monitoring Student State (motivation) . . . . . . . . . . . . . . . . . 12
4.4 Cobot: A Software Agent . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5 An Intelligent Natural Language Conversational System for Aca-
demic Advising INSTAVIS . . . . . . . . . . . . . . . . . . . . . . . 19
5 FUTURE WORK 20
6 CONCLUSION 21
7 ACKNOWLEDGMENT 23
4
Chapter 1
DEFINITIONS, ACRONYMS
AND ABBREVIATIONS
The following are the list of conventions and acronyms used in this document and
the project as well:
• ASR: Automatic Speech Recognition
• NLU: Natural Language Processing
• DM: Dialog Management
• SDS: Spoken Dialog System
• SPOKEN DIALOG SYSTEM: A spoken dialogue system (SDS) is an au-

tonomous agent that communicates with people in the most natural and con-
venient way through spoken language.
5
Chapter 2
INTRODUCTION
The goal of this project is to generate an empirically-based understanding of the

ramifications of adding spoken language capabilities to text-based dialogue tutors,
and to understand how these implications might differ in human-human and human-
computer spoken interactions. This research will explore the relative effectiveness
of 3 SDS:
• ITSPOKE: A speech-based dialogue system that uses a text-based system for
tutoring conceptual physics as its “back-end.” The results of this work will
demonstrate whether spoken dialogues yield increased performance compared
to text with respect to a variety of evaluation measures, whether the same or
different student and tutor behaviors correlate with learning gains in speech
and text, and how such findings generalize both across and within human
and computer tutoring conditions. These results will impact the development
of future dialogue tutoring systems incorporating speech, by highlighting the
performance gains that can be expected, and the requirements for achieving
such gains.
• COBOT DS: CobotDS provides real-time, two-way, natural language commu-

nication between a phone user and the multiple users in the text environment.
We describe a number of the challenging design issues we faced, and our use of
summarization, social filtering and personalized grammars in tackling them.
We report a number of empirical findings from a small user study.
• INSTAVIS: Academic advisors assist students in academic, professional, so-

cial and personal matters. Successful advising increases student retention,
improves graduation rates and helps students meet educational goals. This
work presents an advising system that assists advisors in multiple tasks us-
ing natural language. This system features a conversational agent as the user
interface, an academic advising knowledge base with a method to allow the
users to contribute to it, an expert system for academic planning, and a web
design structure for the implementation platform. The system is operational
for several hundred students from a university department.
6
2.1 WHY IS THE PARTICULAR TOPIC CHO-
SEN?
There is increasing interest in building dialogue systems that can detect and adapt
to user affective states. However, while this line of research is promising, there is still
much work to be done. For example, most research has focused on detecting user
affective states, rather than on developing dialogue strategies that adapt to such
states once detected. In addition, when affect-adaptive dialogue systems have been
developed, most systems detect and adapt to only a single user state, and typically
assume that the same affect-adaptive strategy will be equally effective for all users.
2.2 OBJECTIVE AND SCOPE

In this project we take a step towards examining these issues, by presenting an
evaluation of three versions of an affect-adaptive spoken tutorial dialogue system:
one that detects and adapts to both user disengagement and uncertainty, one that
adapts to only disengagement, and one that doesnt adapt to affect at all. We target
disengagement and uncertainty because these were the most frequent affective states
in prior studies with our system and their presence was negatively correlated with
task success. The detection of these and similar states is also of interest to the larger
speech and language processing communities.
2.3 METHODOLOGY:
These are the factors of analysis:
• Metaphor
• Language
• Utterance length
• Semantics Simple semantics.
• Syntax More predictable. Less predictable.
• Language models
• Language coverage challenge.
7
2.4 WHAT CONTRIBUTION WOULD THE PROJECT
MAKE?
The overall objective of the project is to support rapid, cost-effective development
of speech-enabled dialogue systems. Current commercial technology for speech-
enabled interfaces has made rapid progress over the past decade. There are increas-
ing numbers of systems deployed in commercial applications that provide structured
system-initiated interaction. These systems work by controlling the conversation,
requesting that the user provide a specific kind of information at each turn. How-
ever, these systems do not yet have true conversational capability. This project will
help us in:
• Building robust systems that can engage in true mixed initiative interaction
• Building Conversational systems that should be able to interact naturally with

the user
• Supporting both user and system initiative, providing clarification, negotiation
• Providing ability to recover from user and system errors.
• Exploring the issues of mixed initiative conversational interaction.
• Taking further innovative research on dialogue management and interface de-

sign to support conversational systems.
• Encouraging the transfer of this technology to real users,
8
Chapter 3
OVERALL DESCRIPTION
3.1 NON-FUNCTIONAL REQUIREMENTS

• Error handling: The product shall handle expected and non-expected errors
in ways that prevent loss of information and long downtime period.
• Performance Requirements:
– The system shall accommodate high number of books and users without
any fault.
– Responses to view information shall take no longer than 5 seconds to
appear on the screen.
• -Safety Requirements: System use shall not cause any harm to human users.
9
Chapter 4
WORK DONE
4.1 Spoken Dialogue for Intelligent Tutoring Sys-

tems
• Self-explanation correlates with learning and occurs more in speech
– TUTOR: The right side pumps blood to the lungs, and the left side
pumps blood to the other parts of the body. Could you explain how that
works?
– STUDENT 1 (self-explains): So the septum is a divider so that the blood
doesn’t get mixed up. So the right side is to the lungs, and the left side
is to the body. So the septum is like a wall that divides the heart into
two parts...it kind of like separates it so that the blood doesn’t get mixed
up...
– STUDENT 2 (doesnt self-explain): right side pumps blood to lungs
• Speech contains prosodic information, providing new sources of information

about the student for dialogue.
– ITSPOKE: How does his velocity compare to that of his keys?

– STUDENT: his velocity is constant
10
4.2 Predictive Performance Modeling
Opportunity :
Spoken dialogue system evaluation methodologies can improve our understand-
ing of how dialogue facilitates student learning
• Measure parameters (interaction costs and benefits) and performance in a

system corpus
• Train model via multiple linear regression over parameters, predicting perfor-
mance System Performance = wi * pi
• Test model on new corpus
• Predict performance during future system design
Challenges :
• How to measure system performance?
• What are predictive interaction parameters?
• Understand why a spoken dialogue system fails or succeeds
• System Performance
– Prior evaluations used User Satisfaction

– Is Student Learning more relevant for the tutoring domain?
– Interaction Parameters
– Prior applications used Generic parameters
– Are Task-Specific and Affective parameters also useful?
11
4.3 Monitoring Student State (motivation)
• Tutor26: Let me ask my question in a more general way. What happens to
the velocity of a body when there is no force acting on it?
• Student27: dammit (ASR: it is)
• Tutor28 : Could you please repeat that?
• Student29 : same (ASR: i same)
• Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body

when there is no HORIZONTAL force acting on it?
• Student31: zero (ASR: the zero)
• Tutor32 : Well... OK, let’s consider how forces in the vertical direction affect
the motion of a body in a horizontal direction in a different scenario
• Student33: oh gravity you already said this (ASR: beats gravity you’re exert
less)
• Tutor34: Fine. Are there any other forces acting on the apple as it falls?
• Student35: no why are you doing this again (ASR: no y and to it yes)
• Tutor36: Correct. The only force acting on the apple as it falls to the earth
is the earth’s gravitational force. In which direction does gravitational force
act?
• Student37: downward you computer (ASR: downward you computer)
12
Challenge 1: What emotions to detect?
Negative: Angry, Annoyed, Frustrated, Bored, Confused, Uncertain, Con-
tempt, Disgusted, Sad
Positive/Neutral: Certain, Curious, Enthusiastic, Eureka
Example Student States in ITSPOKE:
• ITSPOKE: What else do you need to know to find the boxs acceleration?
• Student: the direction [UNCERTAIN]
• ITSPOKE : If you see a body accelerate, what caused that acceleration?
• Student: force [CERTAIN]
• ITSPOKE : Good job. Say there is only one force acting on the box. How
is this force, the box’s mass, and its acceleration related? Student: velocity
[UNCERTAIN]
• ITSPOKE : Could you please repeat that?
• Student: velocity [ANNOYED]
Challenge 2: How to respond?

• In tutoring, not all negatively-valenced emotions are bad!
While frustration/anger/annoyance is often frustrating
Frustration can also be an opportunity to learn
• Example from AutoTutor :

neutral flow confusion frustration neutral
– ITSPOKE: What is the net force acting on the truck equal to?
– Student: I dont know [Uncertain, Frustrated]
– ITSPOKE: Since net force is defined as the vector sum of all the forces
exerted on an object, let’s first find the forces acting on the truck. Try
to name them.
– Student: The impact force, the normal force, and gravity [Certain, Neu-
tral]
– ITSPOKE: Yep.
13
Discourse Structure:
Opportunity
• Dialogues with tutoring systems have more complex hierarchical discourse

structures compared to many other types of dialogues Challenges
• How can discourse structure be exploited in the context of spoken dialogue

systems?
• Average ITSPOKE dialogue is 20 minutes Student turns are hierarchically

structured
– Level 1 : 1350 (57.3

– Level 2 : 643 (27.3
– Level 3 : 248 (10.5
– Levels 4-6 :113 (4.8
14
Findings:
• Statistically significant dependencies exist between students state of certainty
and the responses of an expert human tutor
• After uncertain, tutor Bottoms Out and avoids expansions
• After certain, tutor Restates
• After mixed, tutor Hints
• After any emotion, tutor increases Feedback
• Dependencies suggest adaptive strategies for implementation in computer tu-

toring systems
• Statistically significant dependencies exist between student state and speech

recognition problems
– Frustrated/Angry turns are rejected more than expected

– Uncertain turns have more problems than expected (certain turns have
less)
– Incorrect turns have more problems than expected (correct turns have
less)
• Learning opportunities (e.g. uncertain and incorrect student states) have more
speech recognition problems o However, speech recognition problems have not
negatively correlated with learning
• Student correctness is predictive of student learning, but only after particular

discourse transitions e.g., After Pops (PopUp, PopUpAdvance)
– incorrect turns negatively predict learning

– correct turns positively predict learning
• Student certainness is more predictive only after particular transitions
• While single discourse transitions are not predictive of learning, patterns in

the discourse structure are
• e.g., Advance-Advance and Push-Push both positively correlate with learning
• Statistically significant dependencies exist between discourse transitions and

speech recognition
• e.g., after both Pushes and Pops, more misrecognitions
15
4.4 Cobot: A Software Agent
• CobotDS provides real-time, two-way, natural language communication be-
tween a phone user and the multiple users in the text environment.
• We describe a number of the challenging design issues we faced, and our use of
summarization, social filtering and personalized grammars in tackling them.
We report a number of empirical findings from a small user study.
• Cobot is one of the most popular LambdaMOO residents, and both chats with
human users, and provides them with social statistics summarizing their usage
of verbs and interactions with other users (such as who they interact with, who
are the most popular users, and so on).
• CobotDS provides LambdaMOO users with spoken telephony access to Cobot,

and is an experiment in providing a rich social connection between a telephone
user and the text-based LambdaMOO users.
• To support conversation, CobotDS passes messages and verbs from the phone
user to LambdaMOO users (via automatic speech recognition, or ASR), and
from LambdaMOO to the phone user (via text-tospeech, or TTS).
• CobotDS also provides listening (allowing phone users to hear a description

of all LambdaMOO activity), chat summarization and filtering, personalized
grammars, and many other features.
Text to speech conversion
• U1 waves to U2.
• U2 bows gracefully to U1.
• U2 is overwhelmed by all these paper deadlines. U2 begins to slowly tear his

hair out, one strand at a time.
• U1 comforts U2.
• U1 [to U2]: Remember, the mighty oak was once a nut like you.
• U2 [to U1]: Right, but his personal growth was assured. Thanks anyway,
though.
• U1 feels better no
16
Calling Cobot
• Provided a dozen or so friendly LambdaMOO users with acces to a toll-free
Cobot-DS number.
• Users call with LambdaMOO user name, numeric password; then enter main
Cobot-DS command loop.
• Cobot announces phone call user in LambdaMOO
• From LambdaMOO to phone user:
– MOO users direct arbitrary utterances or verbs to Cobot, prefixed by

text phone
– Via TTS, Cobot passes verb or utterance directly to phone user
– Virtually no noise on this channel
• From phone user to LambdaMOO: Cobot passes on utterances, verbs from

phone user (with attribution)
• Mixed in with Cobots other behavior and activities
• But this channel is very noisy (due to ASR)
Personalization of Grammars
• Phone user could change grammar used through command grammar and then
engaging in subdialogue.
• Two built-in grammars:

smalltalk 228 hand-constructed phrases provide basic conversation, e.g., yes,
no, fantastic, terrible, Im at home, etc. Clich 2950 common English sayings,
e.g., taking coal to Pittsburgh J, etc.
• One personal grammar Comprised of list of phrases provided by each phone

user
17
Basic Phone Commands
• 38 standard LambdaMOO verbs, directed or not
• Say command with multiple ASR grammars:
• Smalltalk grammar: 228 useful phrases exclamations Social pleasantries,

assertions of mood
• Statements of whereabouts
• Clich grammar: 2950 English witticisms and sayings
• User specific personal grammars, periodically updated/modified User can

control grammar via the grammar command
• Listen command:
• At every dialogue turn, CobotDS will attempt to describe all activity
• Provides phone user richer view, allows passive participation User has no
scrollback
• Pace of activity can quickly outrun TTS rate Thus filter activity, including
via social rules
Other Useful Phone Commands
• Where and who commands
• Summarize command
– Intended for use in non-listening mode

– Provides summary of last n minutes of activity
– Describes which users have generated most activity
– Characterizes interactions via verb usage
18
4.5 An Intelligent Natural Language Conversa-
tional System for Academic Advising INSTAVIS
• Academic advisors assist students in academic, professional, social and per-
sonal matters. Successful advising increases student retention, improves grad-
uation rates and helps students meet educational goals.
• This is an advising system that assists advisors in multiple tasks using natural
language.
• This system features a conversational agent as the user interface, an academic

advising knowledge base with a method to allow the users to contribute to it,
an expert system for academic planning, and a web design structure for the
implementation platform.
• The system is operational for several hundred students from a university de-
partment.
• The system performed well, obtaining close to 80
19
Chapter 5
FUTURE WORK
Our results contribute to the increasing body of literature demonstrating the utility
of adding fully-automated affect-adaptation to existing spoken dialogue systems. In
future work we will examine other performance measures besides learning, and will
manually annotate true disengagement and uncertainty in order to group students by
amount of disengagement. Second, our results contribute to the literature suggesting
that gender effects should be considered when designing dialogue systems. However,
further research is needed to determine more effective combinations of disengagement
and uncertainty adaptations for both males and females, and to investigate whether
gender differences might be related to other types of measurable user factors.As
future work, we will integrate statistical measurements from the log files content
and include indirect evaluations by the constituencies. The evaluation data will offer
advisors documented assessment of the areas of advising that most concern students.
Additional work includes adding an expert system for academic enrollment planning,
a mechanism to forward selected conversations to advisors and allowing users to add
lexical definitions.
20
Chapter 6
CONCLUSION
Spoken Dialogue Systems are of great interest to researchers in Intelligent Tutoring.

To be able to engage in conversation, a spoken dialogue system has to attend to,
recognize and understand what the user is saying, interpret utterances in context,
decide what to say next, as well as when and how to say it. To achieve this, a
wide range of research areas and technologies must be involved, such as automatic
speech recognition, natural language understanding, dialogue management, natural
language generation and speech synthesis.
One-on-one tutoring is a powerful technique for helping students learn Natural
language dialogue. It contributes in a powerful way to the efficacy of one-on-one-
tutoring using presently available NLP technology. Computer tutors can be built and
can serve as a valuable aid to student learning. Intelligent Tutoring in turn provides
many opportunities and challenges for researchers in Spoken Dialogue Systems,
• Performance Evaluation
• Affective Reasoning
• Discourse Analysis
• and many more!
21
Bibliography
[1] University of Pittsburgh E-resources
[2] www.google.co.in.
[3] du.ac.in E-Journals
[4] MIT E-resources
22
Chapter 7
ACKNOWLEDGMENT
On the successful completion of our project on Analysis of Spoken Dialog Systems

, we would like to express our sincere gratitude to everyone who helped us in the
completion of our project. Our project would not have been possible without the
proper and rigorous guidance of our mentors, our teachers Ms. Ritu Singhal , who
guided us throughout our project in every possible way with their invaluable advice,
useful suggestions and relevant ideas that facilitated the completion of our project.
We feel honored and privileged to work under them. It has helped us to achieve our
goal of successfully completing the task at hand and facing our challenges, so that
we would not procrastinate in obtaining the desired result.

4129441075 Twinkle
4129441076 Tanya
Thank you.
(April, 2017)
IP College for Women
23

Analysis of Spoken Dialog Systems: A Project Report On

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Analysis of Spoken Dialog Systems: A Project Report On

Hochgeladen von

Copyright:

Verfügbare Formate

A Project Report

Analysis of Spoken Dialog Systems

Roll No Names of Students

4129441005 Tanpreet Kaur

Under the guidance of

Department of Computer Science and Engineering

Roll No Names of Students

4129441005 Tanpreet Kaur

Ms. Ritu Singhal

Dr. Manju Bala

1 DEFINITIONS, ACRONYMS AND ABBREVIATIONS 5

• ASR: Automatic Speech Recognition

• NLU: Natural Language Processing

• DM: Dialog Management

• SDS: Spoken Dialog System

• SPOKEN DIALOG SYSTEM: A spoken dialogue system (SDS) is an au-

The goal of this project is to generate an empirically-based understanding of the

• COBOT DS: CobotDS provides real-time, two-way, natural language commu-

• INSTAVIS: Academic advisors assist students in academic, professional, so-

2.2 OBJECTIVE AND SCOPE

• Semantics Simple semantics.

• Syntax More predictable. Less predictable.

• Language coverage challenge.

• Building Conversational systems that should be able to interact naturally with

• Supporting both user and system initiative, providing clarification, negotiation

• Providing ability to recover from user and system errors.

• Exploring the issues of mixed initiative conversational interaction.

• Taking further innovative research on dialogue management and interface de-

• Encouraging the transfer of this technology to real users,

3.1 NON-FUNCTIONAL REQUIREMENTS

4.1 Spoken Dialogue for Intelligent Tutoring Sys-

• Speech contains prosodic information, providing new sources of information

– ITSPOKE: How does his velocity compare to that of his keys?

• Measure parameters (interaction costs and benefits) and performance in a

• Test model on new corpus

• Predict performance during future system design

• How to measure system performance?

• What are predictive interaction parameters?

• Understand why a spoken dialogue system fails or succeeds

– Prior evaluations used User Satisfaction

• Student27: dammit (ASR: it is)

• Tutor28 : Could you please repeat that?

• Student29 : same (ASR: i same)

• Tutor30 : Right. And what happens to the HORIZONTAL velocity of a body

• Student31: zero (ASR: the zero)

• Student37: downward you computer (ASR: downward you computer)

Positive/Neutral: Certain, Curious, Enthusiastic, Eureka

Example Student States in ITSPOKE:

• Student: the direction [UNCERTAIN]

• ITSPOKE : If you see a body accelerate, what caused that acceleration?

• Student: force [CERTAIN]

• ITSPOKE : Could you please repeat that?

• Student: velocity [ANNOYED]

Challenge 2: How to respond?

• Example from AutoTutor :

• Dialogues with tutoring systems have more complex hierarchical discourse

• How can discourse structure be exploited in the context of spoken dialogue

• Average ITSPOKE dialogue is 20 minutes Student turns are hierarchically

– Level 1 : 1350 (57.3

• After uncertain, tutor Bottoms Out and avoids expansions

• After certain, tutor Restates

• After mixed, tutor Hints