Beruflich Dokumente
Kultur Dokumente
Figure 1: Illustrative example of the Speaker Model introduced in this work. Speaker IDs close in embedding space tend to
respond in the same manner. These speaker embeddings are learned jointly with word embeddings and all other parameters of
the neural model via backpropagation. In this example, say Rob is a speaker clustered with people who often mention England
in the training data, then the generation of the token england at time t = 2 would be much more likely than that of u.s.. A
non-persona model would prefer generating in the u.s. if u.s. is more represented in the training data across all speakers.
7.0% and 8.4% percent lower than the correspon- message Where is your hometown?
baseline I was born in Canada.
dent standard S EQ 2S EQ models. In terms of B LEU user1 Im from England.
score, we observe a similar performance boost user2 Im from Manchester.
as on the Twitter dataset, in which the Speaker user3 Im from Liverpool.
user4 England. You?
model and the Speaker-Addressee model outper- user5 My hometown.
form the standard S EQ 2S EQ model by 13.7% and user6 Im from Texas.
10.6%. By comparing the Speaker-Addressee user7 Im from LA.
user8 England.
model against the Speaker model on the TV Se- user9 I was born here.
ries dataset, we do not observe a significant dif- user10 I was born in the us.
ference. We suspect that this is primarily due to message What company do you work for?
baseline I work for a company.
the relatively small size of the dataset where the user1 I dont have a job.
interactive patterns might not be fully captured. user2 Im a manager.
Smaller values of perplexity are observed for the user3 Im working. #
user4 I work for a company.
Television Series dataset than the Twitter dataset, user5 Im working. #
the perplexity of which is over 40, presumably due user6 Customer service.
to the noisier nature of Twitter dialogues. user7 I work in retail.
user8 I work in a cafe.
user9 I work at a gas station.
6.4 Qualitative Analysis user10 I dont know #
message What did you have for dinner?
Diverse Responses by Different Speakers Ta- baseline I had fish and chips.
ble 7 represents responses generated by persona user1 I had spag bol.
user2 Chicken and chips.
models in response to three different input ques- user3 Chicken and rice.
tions. We randomly selected 10 speakers (without user4 Fish and chips.
cherry-picking) from the original Twitter dataset. user5 I had spag bol.
user6 I had Mexican food.
We collected their user level representations from user7 Salad...
a speaker look-up table and integrated them into user8 I had chicken and chips.
the decoding models. The model tends to generate user9 I had spag bol.
user10 Pizza.
specific responses for different people in response
to the factual questions.6 Table 7: Responses generated by the baseline (LSTM-MMI)
Table 8 shows responses generated from the and the Speaker Model for ten randomly selected users, with-
Speaker-Addressee Model using the TV-series out cherry picking. # indicates poor-quality responses pro-
duced by the system.
dataset. Interestingly, we regularly observe that
this model is sensitive to the identity of the ad-
dressee, generating words specifically targeted at
Human Evaluation We conducted a human eval-
that addressee (e.g., her name). For example, the
uation of outputs from the Speaker Model, using
model produces Of course, I love you, Emily in
a crowdsourcing service. Since we cannot expect
response to an input from Emily. Also, the model
crowdsourced human judges to know or attempt to
generates Of course I love you. ( kisses him), where
learn the ground truth of Twitter users who are not
the pronoun him accurately identifies the gender
well-known public figures, we designed our experi-
of the addressee.
ment to evaluate the consistency of outputs associ-
6
There appears to be a population bias in the training set ated with the speaker IDs. To this end, we collected
that favors British users. 24 pairs of questions for which we would expect
message Do you love me? User1
baseline Of course, I love you. message Where is your hometown?
addressee: Emily, speaker: Sheldon response Im from England.
response Of course, I love you, Emily. message Where are you from?
addressee: Emily, speaker: Penny response Im from England.
response Of course, I love you, Emily. message In which city do you live now?
addressee: Leonard, speaker: Penny response I live in London.
response Of course, I love you, Leonard. message In which country do you live now?
addressee: Monica, speaker: Chandler response I live in England.
response Of course I love you. I want to marry you. message What is your major?
addressee: Rachel, speaker: Chandler response Business. You?
response Of course I love you. message What did you study in college?
addressee: Ross, speaker: Chandler response I did business studies.
response Of course I love you. message How old are you?
addressee: Ross, speaker: Rachel response Im 18.
response Of course I love you. (kisses him) message What is your age?
addressee: Emily, speaker: Rachel response Im 18.
response Of course I love you. User2
message Where is your hometown?
Table 8: Responses to Do you love me? from the Speaker- response I live in Indonesia.
Addressee model on the TV-series dataset using different ad- message Where are you from?
dressees and speakers. response England, you?
message In which city do you live now?
response I live in Jakarta.
message In which country do you live now?
responses to be consistent if the persona model is response I live in Indonesia.
coherent. For example, responses to the questions message What is your major?
response Business, you?
What country do you live in? and What city do you message What did you study in college?
live in? would be considered consistent if the an- response Psychology, you?
swers were England and London respectively, but message How old are you?
response Im 18.
not if they were UK and Chicago. Similarly, the message What is your age?
responses to Are you vegan or vegetarian? and Do response Im 16.
you eat beef? are consistent if the answers gener-
ated are vegan and absolutely not, but not if they Table 9: Examples of speaker consistency and inconsistency
generated by the Speaker Model
are vegan and I love beef. We collected 20 pairs of
outputs for randomly-selected personas provided
by the Speaker Model for each question pair (480
with only 1.6% for the baseline model. It should
response pairs total). We also obtained the corre-
be emphasized that the baseline model is a strong
sponding outputs from the baseline MMI-enhanced
baseline, since it represents the consensus of all
S EQ 2S EQ system.
70K Twitter users in the dataset7 .
Since our purpose is to measure the gain in con-
sistency over the baseline system, we presented the Table 9 illustrates how consistency is an emer-
pairs of answers system-pairwise, i.e., 4 responses, gent property of two arbitrarily selected users. The
2 from each system, displayed on the screen, and model is capable of discovering the relations be-
asked judges to decide which of the two systems tween different categories of location such as Lon-
was more consistent. The position in which the don and the UK, Jakarta and Indonesia. However,
system pairs were presented on the screen was ran- the model also makes inconsistent response de-
domized. The two systems were judged on 5-point cisions, generating different answers in the sec-
zero-sum scale, assigning a score of 2 (-2) if one ond example in response to questions asking about
system was judged more (less) consistent than the age or major. Our proposed persona models inte-
other, and 1 (-1) if one was rated somewhat more grate user embeddings into the LSTM, and thus
(less) consistent. Ties were assigned a score of zero. can be viewed as encapsulating a trade-off between
Five judges rated each pair and their scores were a persona-specific generation model and a general
averaged and remapped into 5 equal-width bins. conversational model.
After discarding ties, we found the persona model
was judged either more consistent or somewhat 7
Im not pregnant is an excellent consensus answer to the
more consistent in 56.7% of cases. If we ignore question Are you pregnant?, while Im pregnant is consistent
as a response only in the case of someone who also answers
the somewhat more consistent judgments, the the question Are you a guy or a girl? with something in the
persona model wins in 6.1% of cases, compared vein of Im a girl.
7 Conclusions Yun-Nung Chen, Wei Yu Wang, and Alexander Rud-
nicky. 2013. An empirical investigation of sparse
We have presented two persona-based response log-linear models for improved dialogue act classifi-
generation models for open-domain conversation cation. In Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on,
generation. There are many other dimensions of pages 83178321. IEEE.
speaker behavior, such as mood and emotion, that
are beyond the scope of the current paper and must Werner Deutsch and Thomas Pechmann. 1982. Social
be left to future work. interaction and the development of definite descrip-
Although the gains presented by our new mod- tions. Cognition, 11:159184.
els are not spectacular, the systems outperform our Michel Galley, Chris Brockett, Alessandro Sordoni,
baseline S EQ 2S EQ systems in terms of B LEU, per- Yangfeng Ji, Michael Auli, Chris Quirk, Margaret
plexity, and human judgments of speaker consis- Mitchell, Jianfeng Gao, and Bill Dolan. 2015.
tency. We have demonstrated that by encoding B LEU: A discriminative metric for generation
tasks with intrinsically diverse targets. In Proc. of
personas in distributed representations, we are able
ACL-IJCNLP, pages 445450, Beijing, China, July.
to capture personal characteristics such as speaking
style and background information. In the Speaker- Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng.
Addressee model, moreover, the evidence suggests 2014. Learning continuous phrase representations
that there is benefit in capturing dyadic interactions. for translation modeling. In Proc. of ACL, pages
699709, Baltimore, Maryland.
Our ultimate goal is to be able to take the pro-
file of an arbitrary individual whose identity is Sepp Hochreiter and Jurgen Schmidhuber. 1997.
not known in advance, and generate conversations Long short-term memory. Neural computation,
that accurately emulate that individuals persona 9(8):17351780.
in terms of linguistic response behavior and other
Alfred Kobsa. 1990. User modeling in dialog systems:
salient characteristics. Such a capability will dra- Potentials and hazards. AI & society, 4(3):214231.
matically change the ways in which we interact
with dialog agents of all kinds, opening up rich Esther Levin, Roberto Pieraccini, and Wieland Eckert.
new possibilities for user interfaces. Given a suffi- 2000. A stochastic model of human-machine inter-
ciently large training corpus in which a sufficiently action for learning dialog strategies. IEEE Transac-
tions on Speech and Audio Processing, 8(1):1123.
rich variety of speakers is represented, this objec-
tive does not seem too far-fetched. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and Bill Dolan. 2016. A diversity-promoting ob-
Acknowledgments jective function for neural conversation models. In
Proc. of NAACL-HLT.
We with to thank Stephanie Lukin, Pushmeet Kohli,
Chris Quirk, Alan Ritter, and Dan Jurafsky for Grace I Lin and Marilyn A Walker. 2011. All the
worlds a stage: Learning character models from
helpful discussions. film. In Proceedings of the Seventh AAAI Confer-
ence on Artificial Intelligence and Interactive Digi-
tal Entertainment (AIIDE).
References
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,
David Ameixa, Luisa Coheur, Pedro Fialho, and Paulo and Wojciech Zaremba. 2015. Addressing the rare
Quaresma. 2014. Luke, I am your father: dealing word problem in neural machine translation. In Proc.
with out-of-domain requests by using movies sub- of ACL, pages 1119, Beijing, China, July.
titles. In Intelligent Virtual Agents, pages 1321.
Springer. Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki
Toda, Mirna Adriani, and Satoshi Nakamura. 2014.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Developing non-goal dialog system based on exam-
gio. 2015. Neural machine translation by jointly ples of drama television. In Natural Interaction with
learning to align and translate. In Proc. of the Inter- Robots, Knowbots and Smartphones, pages 355361.
national Conference on Learning Representations Springer.
(ICLR).
Franz Josef Och. 2003. Minimum error rate training in
Rafael E Banchs and Haizhou Li. 2012. IRIS: a chat- statistical machine translation. In Proceedings of the
oriented dialogue system based on the vector space 41st Annual Meeting of the Association for Compu-
model. In Proc. of the ACL 2012 System Demonstra- tational Linguistics, pages 160167, Sapporo, Japan,
tions, pages 3742. July. Association for Computational Linguistics.
Alice H Oh and Alexander I Rudnicky. 2000. Stochas- Marilyn A Walker, Rashmi Prasad, and Amanda Stent.
tic language generation for spoken dialogue systems. 2003. A trainable generator for recommendations in
In Proceedings of the 2000 ANLP/NAACL Workshop multimodal dialog. In INTERSPEECH.
on Conversational systems-Volume 3, pages 2732.
Marilyn A Walker, Ricky Grant, Jennifer Sawyer,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Grace I Lin, Noah Wardrip-Fruin, and Michael
Jing Zhu. 2002. B LEU: a method for automatic Buell. 2011. Perceived or not perceived: Film char-
evaluation of machine translation. In Proc. of ACL, acter models for expressive nlg. In Interactive Story-
pages 311318. telling, pages 109121. Springer.
Roberto Pieraccini, David Suendermann, Krishna Marilyn A Walker, Grace I Lin, and Jennifer Sawyer.
Dayanidhi, and Jackson Liscombe. 2009. Are we 2012. An annotated corpus of film dialogue for
there yet? research in commercial spoken dialog learning and characterizing character style. In
systems. In Text, Speech and Dialogue, pages 313. LREC, pages 13731378.
Springer.
William Yang Wang, Ron Artstein, Anton Leuski, and
Adwait Ratnaparkhi. 2002. Trainable approaches to David Traum. 2011. Improving spoken dialogue
surface natural language generation and their appli- understanding using phonetic mixture models. In
cation to conversational dialog systems. Computer FLAIRS Conference.
Speech & Language, 16(3):435455.
Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-
Alan Ritter, Colin Cherry, and William B Dolan. 2011. Hao Su, David Vandyke, and Steve Young. 2015.
Data-driven response generation in social media. In Semantically conditioned LSTM-based natural lan-
Proceedings of the Conference on Empirical Meth- guage generation for spoken dialogue systems. In
ods in Natural Language Processing, pages 583 Proc. of EMNLP, pages 17111721, Lisbon, Portu-
593. gal, September. Association for Computational Lin-
guistics.
Jost Schatztnann, Matthew N Stuttle, Karl Weilham-
mer, and Steve Young. 2005. Effects of the Kaisheng Yao, Geoffrey Zweig, and Baolin Peng.
user model on simulation-based learning of dialogue 2015. Attention with intention for a neural network
strategies. In Automatic Speech Recognition and Un- conversation model. CoRR, abs/1510.08565.
derstanding, 2005 IEEE Workshop on, pages 220 Steve Young, Milica Gasic, Simon Keizer, Francois
225. Mairesse, Jost Schatzmann, Blaise Thomson, and
Kai Yu. 2010. The hidden information state model:
Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,
A practical framework for pomdp-based spoken dia-
Aaron Courville, and Joelle Pineau. 2015. Building
logue management. Computer Speech & Language,
end-to-end dialogue systems using generative hierar-
24(2):150174.
chical neural network models. In Proc. of AAAI.