A Persona-Based Neural Conversation Model

A Persona-Based Neural Conversation Model
Jiwei Li1* Michel Galley2 Chris Brockett2

Georgios P. Spithourakis3* Jianfeng Gao2 Bill Dolan2
1
Stanford University, Dept. of Computer Science
jiweil@stanford.edu
2
Microsoft Research, Redmond
{mgalley,chrisbkt,jfgao,billdol}@microsoft.com
3
University College London, Dept. of Computer Science
g.spithourakis@cs.ucl.ac.uk
Abstract message Where do you live now?

arXiv:1603.06155v2 [cs.CL] 8 Jun 2016
response I live in Los Angeles.

message In which city do you live now?
We present persona-based models for han- response I live in Madrid.
message In which country do you live now?
dling the issue of speaker consistency in response England, you?
neural response generation. A speaker message Where were you born?
model encodes personas in distributed em- response I was born in Canada.
message Where are you from?
beddings that capture individual charac- response England, you?
teristics such as background information message Where did you grow up?
and speaking style. A dyadic speaker- response I grew up in Texas.
message How old are you?
addressee model captures properties of in- response 16 and you?
teractions between two interlocutors. Our message Whats your age?
models yield qualitative performance im- response 18.
message What is your major?
provements in both perplexity and B LEU response Im majoring in psychology
scores over baseline sequence-to-sequence message What did you study in college?
models, with similar gains in speaker con- response English lit.
sistency as measured by human judges.
Table 1: Inconsistent responses generated by a 4-layer
S EQ 2S EQ model trained on 25 million Twitter conversation
1 Introduction snippets.
As conversational agents gain traction as user in-

terfaces, there has been growing research inter- sonalized avatar-like agents, or game characters.1
est in training naturalistic conversation systems For present purposes, we will define PERSONA
from large volumes of human-to-human interac- as the character that an artificial agent, as actor,
tions (Ritter et al., 2011; Sordoni et al., 2015; plays or performs during conversational interac-
Vinyals and Le, 2015; Li et al., 2016). One major is- tions. A persona can be viewed as a composite
sue for these data-driven systems is their propensity of elements of identity (background facts or user
to select the response with greatest likelihoodin profile), language behavior, and interaction style.
effect a consensus response of the humans repre- A persona is also adaptive, since an agent may
sented in the training data. Outputs are frequently need to present different facets to different human
vague or non-committal (Li et al., 2016), and when interlocutors depending on the interaction.
not, they can be wildly inconsistent, as illustrated Fortunately, neural models of conversation gen-
in Table 1. eration (Sordoni et al., 2015; Shang et al., 2015;
In this paper, we address the challenge of consis- Vinyals and Le, 2015; Li et al., 2016) provide a
tency and how to endow data-driven systems with straightforward mechanism for incorporating per-
the coherent persona needed to model human- sonas as embeddings. We therefore explore two per-
like behavior, whether as personal assistants, per- 1
(Vinyals and Le, 2015) suggest that the lack of a coherent
personality makes it impossible for current systems to pass
* The entirety of this work was conducted at Microsoft. the Turing test.
sona models, a single-speaker S PEAKER M ODEL puts using a S EQ 2S EQ model conditioned on con-
and a dyadic S PEAKER -A DDRESSEE M ODEL, versation history. Other researchers have recently
within a sequence-to-sequence (S EQ 2S EQ) frame- used S EQ 2S EQ to directly generate responses in an
work (Sutskever et al., 2014). The Speaker Model end-to-end fashion without relying on SMT phrase
integrates a speaker-level vector representation into tables (Serban et al., 2015; Shang et al., 2015;
the target part of the S EQ 2S EQ model. Analo- Vinyals and Le, 2015). Serban et al. (2015) propose
gously, the Speaker-Addressee model encodes the a hierarchical neural model aimed at capturing de-
interaction patterns of two interlocutors by con- pendencies over an extended conversation history.
structing an interaction representation from their Recent work by Li et al. (2016) measures mutual
individual embeddings and incorporating it into information between message and response in or-
the S EQ 2S EQ model. These persona vectors are der to reduce the proportion of generic responses
trained on human-human conversation data and typical of S EQ 2S EQ systems. Yao et al. (2015) em-
used at test time to generate personalized responses. ploy an intention network to maintain the relevance
Our experiments on an open-domain corpus of of responses.
Twitter conversations and dialog datasets compris- Modeling of users and speakers has been exten-
ing TV series scripts show that leveraging persona sively studied within the standard dialog model-
vectors can improve relative performance up to ing framework (e.g., (Wahlster and Kobsa, 1989;
20% in B LEU score and 12% in perplexity, with Kobsa, 1990; Schatztnann et al., 2005; Lin and
a commensurate gain in consistency as judged by Walker, 2011)). Since generating meaningful re-
human annotators. sponses in an open-domain scenario is intrinsi-
cally difficult in conventional dialog systems, ex-
2 Related Work isting models often focus on generalizing character
This work follows the line of investigation initiated style on the basis of qualitative statistical analysis
by Ritter et al. (2011) who treat generation of con- (Walker et al., 2012; Walker et al., 2011). The
versational dialog as a statistical machine transla- present work, by contrast, is in the vein of the
tion (SMT) problem. Ritter et al. (2011) represents S EQ 2S EQ models of Vinyals and Le (2015) and Li
a break with previous and contemporaneous dialog et al. (2016), enriching these models by training
work that relies extensively on hand-coded rules, persona vectors directly from conversational data
typically either building statistical models on top and relevant side-information, and incorporating
of heuristic rules or templates (Levin et al., 2000; these directly into the decoder.
Young et al., 2010; Walker et al., 2003; Pieraccini
et al., 2009; Wang et al., 2011) or learning genera- 3 Sequence-to-Sequence Models
tion rules from a minimal set of authored rules or Given a sequence of inputs X = {x1 , x2 , ..., xnX },
labels (Oh and Rudnicky, 2000; Ratnaparkhi, 2002; an LSTM associates each time step with an input
Banchs and Li, 2012; Ameixa et al., 2014; Nio et gate, a memory gate and an output gate, respec-
al., 2014; Chen et al., 2013). More recently (Wen tively denoted as it , ft and ot . We distinguish e
et al., 2015) have used a Long Short-Term Memory and h where et denotes the vector for an individual
(LSTM) (Hochreiter and Schmidhuber, 1997) to text unit (for example, a word or sentence) at time
learn from unaligned data in order to reduce the step t while ht denotes the vector computed by the
heuristic space of sentence planning and surface LSTM model at time t by combining et and ht1 .
realization. ct is the cell state vector at time t, and denotes the
The SMT model proposed by Ritter et al., on sigmoid function. Then, the vector representation
the other hand, is end-to-end, purely data-driven, ht for each time step t is given by:
and contains no explicit model of dialog structure;
the model learns to converse from human-to-human

it
conversational corpora. Progress in SMT stemming

ft ht1
W
ot = (1)

from the use of neural language models (Sutskever est
et al., 2014; Gao et al., 2014; Bahdanau et al., 2015; lt tanh
Luong et al., 2015) has inspired efforts to extend
these neural techniques to SMT-based conversa- ct = ft ct1 + it lt (2)
tional response generation. Sordoni et al. (2015)
augments Ritter et al. (2011) by rescoring out- hst = ot tanh(ct ) (3)
where Wi , Wf , Wo , Wl RK2K . In expensive for our datasets. Instead, our model man-
S EQ 2S EQ generation tasks, each input X is paired ages to cluster users along some of these traits (e.g.,
with a sequence of outputs to predict: Y = age, country of residence) based on the responses
{y1 , y2 , ..., ynY }. The LSTM defines a distribu- alone.
tion over outputs and sequentially predicts tokens Figure 1 gives a brief illustration of the Speaker
using a softmax function: Model. Each speaker i [1, N ] is associated with
a user-level representation vi RK1 . As in stan-
ny
Y dard S EQ 2S EQ models, we first encode message
p(Y |X) = p(yt |x1 , x2 , ..., xt , y1 , y2 , ..., yt1 )
S into a vector representation hS using the source
t=1
ny LSTM. Then for each step in the target side, hidden
Y exp(f (ht1 , eyt )) units are obtained by combining the representation
= P
t=1 y 0 exp(f (ht1 , ey 0 )) produced by the target LSTM at the previous time
step, the word representations at the current time
where f (ht1 , eyt ) denotes the activation function step, and the speaker embedding vi :
between ht1 and eyt . Each sentence terminates
with a special end-of-sentence symbol EOS. In it
ft ht1
keeping with common practices, inputs and out- est
ot = W (4)

puts use different LSTMs with separate parameters vi
to capture different compositional patterns. lt tanh
During decoding, the algorithm terminates when
ct = ft ct1 + it lt (5)
an EOS token is predicted. At each time step, either
a greedy approach or beam search can be adopted hst = ot tanh(ct ) (6)
for word prediction.
where W R4K3K . In this way, speaker informa-
4 Personalized Response Generation tion is encoded and injected into the hidden layer at
each time step and thus helps predict personalized
Our work introduces two persona-based models: responses throughout the generation process. The
the Speaker Model, which models the personal- Speaker embedding {vi } is shared across all con-
ity of the respondent, and the Speaker-Addressee versations that involve speaker i. {vi } are learned
Model which models the way the respondent adapts by back propagating word prediction errors to each
their speech to a given addressee a linguistic phe- neural component during training.
nomenon known as lexical entrainment (Deutsch Another useful property of this model is that it
and Pechmann, 1982). helps infer answers to questions even if the evi-
dence is not readily present in the training set. This
4.1 Notation is important as the training data does not contain ex-
For the response generation task, let M de- plicit information about every attribute of each user
note the input word sequence (message) M = (e.g., gender, age, country of residence). The model
{m1 , m2 , ..., mI }. R denotes the word sequence in learns speaker representations based on conversa-
response to M , where R = {r1 , r2 , ..., rJ , EOS} tional content produced by different speakers, and
and J is the length of the response (terminated speakers producing similar responses tend to have
by an EOS token). rt denotes a word token that similar embeddings, occupying nearby positions
is associated with a K dimensional distinct word in the vector space. This way, the training data of
embedding et . V is the vocabulary size. speakers nearby in vector space help increase the
generalization capability of the speaker model. For
4.2 Speaker Model example, consider two speakers i and j who sound
Our first model is the Speaker Model, which mod- distinctly British, and who are therefore close in
els the respondent alone. This model represents speaker embedding space. Now, suppose that, in
each individual speaker as a vector or embedding, the training data, speaker i was asked Where do
which encodes speaker-specific information (e.g., you live? and responded in the UK. Even if speaker
dialect, register, age, gender, personal informa- j was never asked the same question, this answer
tion) that influences the content and style of her can help influence a good response from speaker
responses. Note that these attributes are not ex- j, and this without explicitly labeled geo-location
plicitly annotated, which would be tremendously information.
in england . EOS
Source Target
where do you live EOS Rob in Rob england Rob . Rob
Speaker embeddings (70k) D_Gomes25 Jinnmeow3

skinnyoflynny2 u.s. london
Word embeddings (50k)

TheCharlieZ england
Rob_712 great
Dreamswalls good
Tomcoatez
Bob_Kelly2
Kush_322
kierongillen5 monday live okay
This_Is_Artful tuesday
The_Football_Bar DigitalDan285 stay
Figure 1: Illustrative example of the Speaker Model introduced in this work. Speaker IDs close in embedding space tend to
respond in the same manner. These speaker embeddings are learned jointly with word embeddings and all other parameters of
the neural model via backpropagation. In this example, say Rob is a speaker clustered with people who often mention England
in the training data, then the generation of the token england at time t = 2 would be much more likely than that of u.s.. A
non-persona model would prefer generating in the u.s. if u.s. is more represented in the training data across all speakers.
4.3 Speaker-Addressee Model ct = ft ct1 + it lt (9)

A natural extension of the Speaker Model is a hst = ot tanh(ct ) (10)
model that is sensitive to speaker-addressee inter-
action patterns within the conversation. Indeed, Vi,j depends on both speaker and addressee and
speaking style, register, and content does not vary the same speaker will thus respond differently to
only with the identity of the speaker, but also with a message from different interlocutors. One po-
that of the addressee. For example, in scripts for tential issue with Speaker-Addressee modelling is
the TV series Friends used in some of our exper- the difficulty involved in collecting a large-scale
iments, the character Ross often talks differently training dataset in which each speaker is involved
to his sister Monica than to Rachel, with whom in conversation with a wide variety of people.
he is engaged in an on-again off-again relationship Like the Speaker Model, however, the Speaker-
throughout the series. Addressee Model derives generalization capabil-
The proposed Speaker-Addressee Model oper- ities from speaker embeddings. Even if the two
ates as follows: We wish to predict how speaker i speakers at test time (i and j) were never involved
would respond to a message produced by speaker j. in the same conversation in the training data, two
Similarly to the Speaker model, we associate each speakers i0 and j 0 who are respectively close in
speaker with a K dimensional speaker-level repre- embeddings may have been, and this can help mod-
sentation, namely vi for user i and vj for user j. We elling how i should respond to j.
obtain an interactive representation Vi,j RK1 4.4 Decoding and Reranking
by linearly combining user vectors vi and vj in
an attempt to model the interactive style of user i For decoding, the N-best lists are generated us-
towards user j, ing the decoder with beam size B = 200. We set a
maximum length of 20 for the generated candidates.
Vi,j = tanh(W1 vi + W2 v2 ) (7) Decoding operates as follows: At each time step,
we first examine all B B possible next-word can-
where W1 , W2 RKK . Vi,j is then linearly in- didates, and add all hypothesis ending with an EOS
corporated into LSTM models at each step in the token to the N-best list. We then preserve the top-B
target: unfinished hypotheses and move to the next word
position.
To deal with the issue that S EQ 2S EQ models

it
ft ht1 tend to generate generic and commonplace re-
W est
ot = (8)

sponses such as I dont know, we follow Li et al.
Vi,j (2016) by reranking the generated N-best list using
lt tanh
a scoring function that linearly combines a length Learning rate is set to 1.0.
penalty and the log likelihood of the source given Parameters are initialized by sampling from
the target: the uniform distribution [0.1, 0.1].
Gradients are clipped to avoid gradient explo-
log p(R|M, v) + log p(M |R) + |R| (11) sion with a threshold of 5.
Vocabulary size is limited to 50,000.
where p(R|M, v) denotes the probability of the
Dropout rate is set to 0.2.
generated response given the message M and the
respondents speaker ID. |R| denotes the length Source and target LSTMs use different sets of pa-
of the target and denotes the associated penalty rameters. We ran 14 epochs, and training took
weight. We optimize and on N-best lists of roughly a month to finish on a Tesla K40 GPU
response candidates generated from the develop- machine.
ment set using MERT (Och, 2003) by optimizing As only speaker IDs of responses were specified
B LEU. To compute p(M |R), we train an inverse when compiling the Twitter dataset, experiments
S EQ 2S EQ model by swapping messages and re- on this dataset were limited to the Speaker Model.
sponses. We trained standard S EQ 2S EQ models for
5.2 Twitter Sordoni Dataset
p(M |R) with no speaker information considered.
The Twitter Persona Dataset was collected for this
5 Datasets paper for experiments with speaker ID informa-
tion. To obtain a point of comparison with prior
5.1 Twitter Persona Dataset
state-of-the-art work (Sordoni et al., 2015; Li et
Data Collection Training data for the Speaker al., 2016), we measure our baseline (non-persona)
Model was extracted from the Twitter FireHose for LSTM model against prior work on the dataset
the six-month period beginning January 1, 2012. of (Sordoni et al., 2015), which we call the Twit-
We limited the sequences to those where the respon- ter Sordoni Dataset. We only use its test-set por-
ders had engaged in at least 60 (and at most 300) tion, which contains responses for 2114 context
3-turn conversational interactions during the period, and messages. It is important to note that the Sor-
in other words, users who reasonably frequently en- doni dataset offers up to 10 references per message,
gaged in conversation. This yielded a set of 74,003 while the Twitter Persona dataset has only 1 refer-
users who took part in a minimum of 60 and a max- ence per message. Thus B LEU scores cannot be
imum of 164 conversational turns (average: 92.24, compared across the two Twitter datasets (B LEU
median: 90). The dataset extracted using responses scores on 10 references are generally much higher
by these conversationalists contained 24,725,711 than with 1 reference). Details of this dataset are
3-turn sliding-window (context-message-response) in (Sordoni et al., 2015).
conversational sequences.
In addition, we sampled 12000 3-turn conversa- 5.3 Television Series Transcripts
tions from the same user set from the Twitter Fire- Data Collection For the dyadic Speaker-
Hose for the three-month period beginning July 1, Addressee Model we used scripts from the
2012, and set these aside as development, valida- American television comedies Friends2 and The
tion, and test sets (4000 conversational sequences Big Bang Theory,3 available from Internet Movie
each). Note that development, validation, and test Script Database (IMSDb).4 We collected 13
sets for this data are single-reference, which is by main characters from the two series in a corpus
design. Multiple reference responses would typ- of 69,565 turns. We split the corpus into train-
ically require acquiring responses from different ing/development/testing sets, with development
people, which would confound different personas. and testing sets each of about 2,000 turns.
Training Protocols We trained four-layer Training Since the relatively small size of the
S EQ 2S EQ models on the Twitter corpus following dataset does not allow for training an open domain
the approach of (Sutskever et al., 2014). Details dialog model, we adopted a domain adaption strat-
are as follows: egy where we first trained a standard S EQ 2S EQ
2
https://en.wikipedia.org/wiki/Friends
4 layer LSTM models with 1,000 hidden cells 3
https://en.wikipedia.org/wiki/The_
for each layer. Big_Bang_Theory
Batch size is set to 128. 4
http://www.imsdb.com
System B LEU Model Standard LSTM Speaker Model
MT baseline (Ritter et al., 2011) 3.60% Perplexity 47.2 42.2 (10.6%)
Standard LSTM MMI (Li et al., 2016) 5.26%
Table 3: Perplexity for standard S EQ 2S EQ and the Speaker
Standard LSTM MMI (our system) 5.82% model on the Twitter Persona development set.
Human 6.08%
Model Objective B LEU
Table 2: B LEU on the Twitter Sordoni dataset (10 references).
We contrast our baseline against an SMT baseline (Ritter et al., Standard LSTM MLE 0.92%
2011), and the best result (Li et al., 2016) on the established Speaker Model MLE 1.12% (+21.7%)
dataset of (Sordoni et al., 2015). The last result is for a human Standard LSTM MMI 1.41%
oracle, but it is not directly comparable as the oracle B LEU is
computed in a leave-one-out fashion, having one less reference Speaker Model MMI 1.66% (+11.7%)
available. We nevertheless provide this result to give a sense
that these B LEU scores of 5-6% are not unreasonable. Table 4: B LEU on the Twitter Persona dataset (1 reference),
for the standard S EQ 2S EQ model and the Speaker model using
as objective either maximum likelihood (MLE) or maximum
models using a much larger OpenSubtitles (OSDb) mutual information (MMI).
dataset (Tiedemann, 2009), and then adapting the
pre-trained model to the TV series dataset. the-art (Li et al., 2016) on an established dataset,
The OSDb dataset is a large, noisy, open-domain the Twitter Sordoni Dataset (Sordoni et al., 2015).
dataset containing roughly 60M-70M scripted lines Our baseline is simply our implementation of the
spoken by movie characters. This dataset does not LSTM-MMI of (Li et al., 2016), so results should
specify which character speaks each subtitle line, be relatively close to their reported results. Table 2
which prevents us from inferring speaker turns. Fol- summarizes our results against prior work. We see
lowing Vinyals et al. (2015), we make the simplify- that our system actually does better than (Li et al.,
ing assumption that each line of subtitle constitutes 2016), and we attribute the improvement to a larger
a full speaker turn.5 We trained standard S EQ 2S EQ training corpus, the use of dropout during training,
models on OSDb dataset, following the protocols and possibly to the conversationalist nature of
already described in Section 5.1. We run 10 itera- our corpus.
tions over the training set.
We initialize word embeddings and LSTM pa- 6.3 Results
rameters in the Speaker Model and the Speaker- We first report performance on the Twitter Persona
Addressee model using parameters learned from dataset. Perplexity is reported in Table 3. We ob-
OpenSubtitles datasets. User embeddings are ran- serve about a 10% decrease in perplexity for the
domly initialized from [0.1, 0.1]. We then ran 5 Speaker model over the standard S EQ 2S EQ model.
additional epochs until the perplexity on the devel- In terms of B LEU scores (Table 4), a significant per-
opment set stabilized. formance boost is observed for the Speaker model
over the standard S EQ 2S EQ model, yielding an in-
6 Experiments crease of 21% in the maximum likelihood (MLE)
6.1 Evaluation setting and 11.7% for mutual information setting
(MMI). In line with findings in (Li et al., 2016), we
Following (Sordoni et al., 2015; Li et al., 2016) observe a consistent performance boost introduced
we used B LEU (Papineni et al., 2002) for parame- by the MMI objective function over a standard
ter tuning and evaluation. B LEU has been shown S EQ 2S EQ model based on the MLE objective func-
to correlate well with human judgment on the re- tion. It is worth noting that our persona models
sponse generation task, as demonstrated in (Galley are more beneficial to the MLE models than to the
et al., 2015). Besides B LEU scores, we also report MMI models. This result is intuitive as the persona
perplexity as an indicator of model capability. models help make Standard LSTM MLE outputs
6.2 Baseline more informative and less bland, and thus make the
use of MMI less critical.
Since our main experiments are with a new dataset For the TV Series dataset, perplexity and B LEU
(the Twitter Persona Dataset), we first show that scores are respectively reported in Table 5 and Ta-
our LSTM baseline is competitive with the state-of- ble 6. As can be seen, the Speaker and Speaker-
5
This introduces a degree of noise as consecutive lines are Addressee models respectively achieve perplexity
not necessarily from the same scene or two different speakers. values of 25.4 and 25.0 on the TV-series dataset,
Model Standard LSTM Speaker Model Speaker-Addressee Model
Perplexity 27.3 25.4 (7.0%) 25.0 (8.4%)
Table 5: Perplexity for standard S EQ 2S EQ and persona models on the TV series dataset.
Model Standard LSTM Speaker Model Speaker-Addressee Model

MLE 1.60% 1.82% (+13.7%) 1.83% (+14.3%)
MMI 1.70% 1.90% (+10.6%) 1.88% (+10.9%)
Table 6: B LEU on the TV series dataset (1 reference), for the standard S EQ 2S EQ and persona models.
7.0% and 8.4% percent lower than the correspon- message Where is your hometown?
baseline I was born in Canada.
dent standard S EQ 2S EQ models. In terms of B LEU user1 Im from England.
score, we observe a similar performance boost user2 Im from Manchester.
as on the Twitter dataset, in which the Speaker user3 Im from Liverpool.
user4 England. You?
model and the Speaker-Addressee model outper- user5 My hometown.
form the standard S EQ 2S EQ model by 13.7% and user6 Im from Texas.
10.6%. By comparing the Speaker-Addressee user7 Im from LA.
user8 England.
model against the Speaker model on the TV Se- user9 I was born here.
ries dataset, we do not observe a significant dif- user10 I was born in the us.
ference. We suspect that this is primarily due to message What company do you work for?
baseline I work for a company.
the relatively small size of the dataset where the user1 I dont have a job.
interactive patterns might not be fully captured. user2 Im a manager.
Smaller values of perplexity are observed for the user3 Im working. #
user4 I work for a company.
Television Series dataset than the Twitter dataset, user5 Im working. #
the perplexity of which is over 40, presumably due user6 Customer service.
to the noisier nature of Twitter dialogues. user7 I work in retail.
user8 I work in a cafe.
user9 I work at a gas station.
6.4 Qualitative Analysis user10 I dont know #
message What did you have for dinner?
Diverse Responses by Different Speakers Ta- baseline I had fish and chips.
ble 7 represents responses generated by persona user1 I had spag bol.
user2 Chicken and chips.
models in response to three different input ques- user3 Chicken and rice.
tions. We randomly selected 10 speakers (without user4 Fish and chips.
cherry-picking) from the original Twitter dataset. user5 I had spag bol.
user6 I had Mexican food.
We collected their user level representations from user7 Salad...
a speaker look-up table and integrated them into user8 I had chicken and chips.
the decoding models. The model tends to generate user9 I had spag bol.
user10 Pizza.
specific responses for different people in response
to the factual questions.6 Table 7: Responses generated by the baseline (LSTM-MMI)
Table 8 shows responses generated from the and the Speaker Model for ten randomly selected users, with-
Speaker-Addressee Model using the TV-series out cherry picking. # indicates poor-quality responses pro-
duced by the system.
dataset. Interestingly, we regularly observe that
this model is sensitive to the identity of the ad-
dressee, generating words specifically targeted at
Human Evaluation We conducted a human eval-
that addressee (e.g., her name). For example, the
uation of outputs from the Speaker Model, using
model produces Of course, I love you, Emily in
a crowdsourcing service. Since we cannot expect
response to an input from Emily. Also, the model
crowdsourced human judges to know or attempt to
generates Of course I love you. ( kisses him), where
learn the ground truth of Twitter users who are not
the pronoun him accurately identifies the gender
well-known public figures, we designed our experi-
of the addressee.
ment to evaluate the consistency of outputs associ-
6
There appears to be a population bias in the training set ated with the speaker IDs. To this end, we collected
that favors British users. 24 pairs of questions for which we would expect
message Do you love me? User1
baseline Of course, I love you. message Where is your hometown?
addressee: Emily, speaker: Sheldon response Im from England.
response Of course, I love you, Emily. message Where are you from?
addressee: Emily, speaker: Penny response Im from England.
response Of course, I love you, Emily. message In which city do you live now?
addressee: Leonard, speaker: Penny response I live in London.
response Of course, I love you, Leonard. message In which country do you live now?
addressee: Monica, speaker: Chandler response I live in England.
response Of course I love you. I want to marry you. message What is your major?
addressee: Rachel, speaker: Chandler response Business. You?
response Of course I love you. message What did you study in college?
addressee: Ross, speaker: Chandler response I did business studies.
response Of course I love you. message How old are you?
addressee: Ross, speaker: Rachel response Im 18.
response Of course I love you. (kisses him) message What is your age?
addressee: Emily, speaker: Rachel response Im 18.
response Of course I love you. User2
message Where is your hometown?
Table 8: Responses to Do you love me? from the Speaker- response I live in Indonesia.
Addressee model on the TV-series dataset using different ad- message Where are you from?
dressees and speakers. response England, you?
message In which city do you live now?
response I live in Jakarta.
message In which country do you live now?
responses to be consistent if the persona model is response I live in Indonesia.
coherent. For example, responses to the questions message What is your major?
response Business, you?
What country do you live in? and What city do you message What did you study in college?
live in? would be considered consistent if the an- response Psychology, you?
swers were England and London respectively, but message How old are you?
response Im 18.
not if they were UK and Chicago. Similarly, the message What is your age?
responses to Are you vegan or vegetarian? and Do response Im 16.
you eat beef? are consistent if the answers gener-
ated are vegan and absolutely not, but not if they Table 9: Examples of speaker consistency and inconsistency
generated by the Speaker Model
are vegan and I love beef. We collected 20 pairs of
outputs for randomly-selected personas provided
by the Speaker Model for each question pair (480
with only 1.6% for the baseline model. It should
response pairs total). We also obtained the corre-
be emphasized that the baseline model is a strong
sponding outputs from the baseline MMI-enhanced
baseline, since it represents the consensus of all
S EQ 2S EQ system.
70K Twitter users in the dataset7 .
Since our purpose is to measure the gain in con-
sistency over the baseline system, we presented the Table 9 illustrates how consistency is an emer-
pairs of answers system-pairwise, i.e., 4 responses, gent property of two arbitrarily selected users. The
2 from each system, displayed on the screen, and model is capable of discovering the relations be-
asked judges to decide which of the two systems tween different categories of location such as Lon-
was more consistent. The position in which the don and the UK, Jakarta and Indonesia. However,
system pairs were presented on the screen was ran- the model also makes inconsistent response de-
domized. The two systems were judged on 5-point cisions, generating different answers in the sec-
zero-sum scale, assigning a score of 2 (-2) if one ond example in response to questions asking about
system was judged more (less) consistent than the age or major. Our proposed persona models inte-
other, and 1 (-1) if one was rated somewhat more grate user embeddings into the LSTM, and thus
(less) consistent. Ties were assigned a score of zero. can be viewed as encapsulating a trade-off between
Five judges rated each pair and their scores were a persona-specific generation model and a general
averaged and remapped into 5 equal-width bins. conversational model.
After discarding ties, we found the persona model
was judged either more consistent or somewhat 7
Im not pregnant is an excellent consensus answer to the
more consistent in 56.7% of cases. If we ignore question Are you pregnant?, while Im pregnant is consistent
as a response only in the case of someone who also answers
the somewhat more consistent judgments, the the question Are you a guy or a girl? with something in the
persona model wins in 6.1% of cases, compared vein of Im a girl.
7 Conclusions Yun-Nung Chen, Wei Yu Wang, and Alexander Rud-
nicky. 2013. An empirical investigation of sparse
We have presented two persona-based response log-linear models for improved dialogue act classifi-
generation models for open-domain conversation cation. In Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on,
generation. There are many other dimensions of pages 83178321. IEEE.
speaker behavior, such as mood and emotion, that
are beyond the scope of the current paper and must Werner Deutsch and Thomas Pechmann. 1982. Social
be left to future work. interaction and the development of definite descrip-
Although the gains presented by our new mod- tions. Cognition, 11:159184.
els are not spectacular, the systems outperform our Michel Galley, Chris Brockett, Alessandro Sordoni,
baseline S EQ 2S EQ systems in terms of B LEU, per- Yangfeng Ji, Michael Auli, Chris Quirk, Margaret
plexity, and human judgments of speaker consis- Mitchell, Jianfeng Gao, and Bill Dolan. 2015.
tency. We have demonstrated that by encoding B LEU: A discriminative metric for generation
tasks with intrinsically diverse targets. In Proc. of
personas in distributed representations, we are able
ACL-IJCNLP, pages 445450, Beijing, China, July.
to capture personal characteristics such as speaking
style and background information. In the Speaker- Jianfeng Gao, Xiaodong He, Wen-tau Yih, and Li Deng.
Addressee model, moreover, the evidence suggests 2014. Learning continuous phrase representations
that there is benefit in capturing dyadic interactions. for translation modeling. In Proc. of ACL, pages
699709, Baltimore, Maryland.
Our ultimate goal is to be able to take the pro-
file of an arbitrary individual whose identity is Sepp Hochreiter and Jurgen Schmidhuber. 1997.
not known in advance, and generate conversations Long short-term memory. Neural computation,
that accurately emulate that individuals persona 9(8):17351780.
in terms of linguistic response behavior and other
Alfred Kobsa. 1990. User modeling in dialog systems:
salient characteristics. Such a capability will dra- Potentials and hazards. AI & society, 4(3):214231.
matically change the ways in which we interact
with dialog agents of all kinds, opening up rich Esther Levin, Roberto Pieraccini, and Wieland Eckert.
new possibilities for user interfaces. Given a suffi- 2000. A stochastic model of human-machine inter-
ciently large training corpus in which a sufficiently action for learning dialog strategies. IEEE Transac-
tions on Speech and Audio Processing, 8(1):1123.
rich variety of speakers is represented, this objec-
tive does not seem too far-fetched. Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and Bill Dolan. 2016. A diversity-promoting ob-
Acknowledgments jective function for neural conversation models. In
Proc. of NAACL-HLT.
We with to thank Stephanie Lukin, Pushmeet Kohli,
Chris Quirk, Alan Ritter, and Dan Jurafsky for Grace I Lin and Marilyn A Walker. 2011. All the
worlds a stage: Learning character models from
helpful discussions. film. In Proceedings of the Seventh AAAI Confer-
ence on Artificial Intelligence and Interactive Digi-
tal Entertainment (AIIDE).
References
Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals,
David Ameixa, Luisa Coheur, Pedro Fialho, and Paulo and Wojciech Zaremba. 2015. Addressing the rare
Quaresma. 2014. Luke, I am your father: dealing word problem in neural machine translation. In Proc.
with out-of-domain requests by using movies sub- of ACL, pages 1119, Beijing, China, July.
titles. In Intelligent Virtual Agents, pages 1321.
Springer. Lasguido Nio, Sakriani Sakti, Graham Neubig, Tomoki
Toda, Mirna Adriani, and Satoshi Nakamura. 2014.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Developing non-goal dialog system based on exam-
gio. 2015. Neural machine translation by jointly ples of drama television. In Natural Interaction with
learning to align and translate. In Proc. of the Inter- Robots, Knowbots and Smartphones, pages 355361.
national Conference on Learning Representations Springer.
(ICLR).
Franz Josef Och. 2003. Minimum error rate training in
Rafael E Banchs and Haizhou Li. 2012. IRIS: a chat- statistical machine translation. In Proceedings of the
oriented dialogue system based on the vector space 41st Annual Meeting of the Association for Compu-
model. In Proc. of the ACL 2012 System Demonstra- tational Linguistics, pages 160167, Sapporo, Japan,
tions, pages 3742. July. Association for Computational Linguistics.
Alice H Oh and Alexander I Rudnicky. 2000. Stochas- Marilyn A Walker, Rashmi Prasad, and Amanda Stent.
tic language generation for spoken dialogue systems. 2003. A trainable generator for recommendations in
In Proceedings of the 2000 ANLP/NAACL Workshop multimodal dialog. In INTERSPEECH.
on Conversational systems-Volume 3, pages 2732.
Marilyn A Walker, Ricky Grant, Jennifer Sawyer,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Grace I Lin, Noah Wardrip-Fruin, and Michael
Jing Zhu. 2002. B LEU: a method for automatic Buell. 2011. Perceived or not perceived: Film char-
evaluation of machine translation. In Proc. of ACL, acter models for expressive nlg. In Interactive Story-
pages 311318. telling, pages 109121. Springer.
Roberto Pieraccini, David Suendermann, Krishna Marilyn A Walker, Grace I Lin, and Jennifer Sawyer.
Dayanidhi, and Jackson Liscombe. 2009. Are we 2012. An annotated corpus of film dialogue for
there yet? research in commercial spoken dialog learning and characterizing character style. In
systems. In Text, Speech and Dialogue, pages 313. LREC, pages 13731378.
Springer.
William Yang Wang, Ron Artstein, Anton Leuski, and
Adwait Ratnaparkhi. 2002. Trainable approaches to David Traum. 2011. Improving spoken dialogue
surface natural language generation and their appli- understanding using phonetic mixture models. In
cation to conversational dialog systems. Computer FLAIRS Conference.
Speech & Language, 16(3):435455.
Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-
Alan Ritter, Colin Cherry, and William B Dolan. 2011. Hao Su, David Vandyke, and Steve Young. 2015.
Data-driven response generation in social media. In Semantically conditioned LSTM-based natural lan-
Proceedings of the Conference on Empirical Meth- guage generation for spoken dialogue systems. In
ods in Natural Language Processing, pages 583 Proc. of EMNLP, pages 17111721, Lisbon, Portu-
593. gal, September. Association for Computational Lin-
guistics.
Jost Schatztnann, Matthew N Stuttle, Karl Weilham-
mer, and Steve Young. 2005. Effects of the Kaisheng Yao, Geoffrey Zweig, and Baolin Peng.
user model on simulation-based learning of dialogue 2015. Attention with intention for a neural network
strategies. In Automatic Speech Recognition and Un- conversation model. CoRR, abs/1510.08565.
derstanding, 2005 IEEE Workshop on, pages 220 Steve Young, Milica Gasic, Simon Keizer, Francois
225. Mairesse, Jost Schatzmann, Blaise Thomson, and
Kai Yu. 2010. The hidden information state model:
Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,
A practical framework for pomdp-based spoken dia-
Aaron Courville, and Joelle Pineau. 2015. Building
logue management. Computer Speech & Language,
end-to-end dialogue systems using generative hierar-
24(2):150174.
chical neural network models. In Proc. of AAAI.
Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.

Neural responding machine for short-text conversa-
tion. In ACL-IJCNLP, pages 15771586.
Alessandro Sordoni, Michel Galley, Michael Auli,

Chris Brockett, Yangfeng Ji, Meg Mitchell, Jian-Yun
Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural
network approach to context-sensitive generation of
conversational responses. In Proc. of NAACL-HLT.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.

Sequence to sequence learning with neural networks.
In Advances in neural information processing sys-
tems (NIPS), pages 31043112.
Jorg Tiedemann. 2009. News from OPUS a collec-

tion of multilingual parallel corpora with tools and
interfaces. In Recent advances in natural language
processing, volume 5, pages 237248.
Oriol Vinyals and Quoc Le. 2015. A neural conver-

sational model. In Proc. of ICML Deep Learning
Workshop.
Wolfgang Wahlster and Alfred Kobsa. 1989. User

models in dialog systems. Springer.

A Persona-Based Neural Conversation Model

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Persona-Based Neural Conversation Model

Hochgeladen von

Copyright:

Verfügbare Formate

A Persona-Based Neural Conversation Model

Jiwei Li1* Michel Galley2 Chris Brockett2

Abstract message Where do you live now?

response I live in Los Angeles.

As conversational agents gain traction as user in-

where do you live EOS Rob in Rob england Rob . Rob

Speaker embeddings (70k) D_Gomes25 Jinnmeow3

Word embeddings (50k)

4.3 Speaker-Addressee Model ct = ft ct1 + it lt (9)

Model Standard LSTM Speaker Model Speaker-Addressee Model

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.

Alessandro Sordoni, Michel Galley, Michael Auli,

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.

Jorg Tiedemann. 2009. News from OPUS a collec-

Oriol Vinyals and Quoc Le. 2015. A neural conver-

Wolfgang Wahlster and Alfred Kobsa. 1989. User

Das könnte Ihnen auch gefallen