Sie sind auf Seite 1von 14

Generating Sentences by Editing Prototypes

Kelvin Guu*2 Tatsunori B. Hashimoto*1,2 Yonatan Oren1 Percy Liang1,2


(* equal contribution)
1
Department of Computer Science 2 Department of Statistics
Stanford University
{kguu,thashim,yonatano}@stanford.edu pliang@cs.stanford.edu

Abstract Overpriced , overrated , and tasteless food .


The food here is ok but not worth the price .
I definitely recommend this restaurante .
We propose a new generative language model
Sample from
for sentences that first samples a prototype the training set
sentence from the training corpus and then ed-
Edit Vector Prototype
its it into a new sentence. Compared to tra-
The food here is ok but not worth the price .
ditional language models that generate from
scratch either left-to-right or by first sampling Edit using
a latent sentence vector, our prototype-then- a�en�on
edit model improves perplexity on language Generation
modeling and generates higher quality outputs The food is mediocre and not worth the ridiculous price .
according to human evaluation. Furthermore, The food is good but not worth the horrible customer service .
the model gives rise to a latent edit vector that The food here is not worth the drama .
The food is not worth the price .
captures interpretable semantics such as sen-
tence similarity and sentence-level analogies. Figure 1: The prototype-then-edit model generates a sen-
tence by sampling a random example from the training set
and then editing it using a randomly sampled edit vector.
1 Introduction
The ability to generate sentences is core to many we propose a new unconditional generative model
NLP tasks, including machine translation, summa- of text which we call the prototype-then-edit model,
rization, speech recognition, and dialogue. Most illustrated in Figure 1. It first samples a random pro-
neural models for these tasks are based on recur- totype sentence from the training corpus, and then
rent neural language models (NLMs), which gener- invokes a neural editor, which draws a random “edit
ate sentences from scratch, often in a left-to-right vector” and generates a new sentence by attending
manner (Bengio et al., 2003). It is often observed to the prototype while conditioning on the edit vec-
that such NLMs suffer from the problem of favoring tor. The motivation is that sentences from the corpus
generic utterances such as “I don’t know” (Li et al., provide a high quality starting point: they are gram-
2016). At the same time, naive strategies to increase matical, naturally diverse, and exhibit no bias to-
diversity have been shown to compromise grammat- wards shortness or vagueness. The attention mech-
icality (Shao et al., 2017), suggesting that current anism (Bahdanau et al., 2015) of the neural editor
NLMs may lack the inductive bias to faithfully rep- strongly biases the generation towards the prototype,
resent the full diversity of complex utterances. and therefore it needs to solve a much easier prob-
Indeed, it is difficult even for humans to write lem than generating from scratch.
complex text from scratch in a single pass; we of- We train the neural editor by maximizing
ten create an initial draft and incrementally revise it an approximation to the generative model’s log-
(Hayes and Flower, 1986). Inspired by this process, likelihood. This objective is a sum over lexically-

437

Transactions of the Association for Computational Linguistics, vol. 6, pp. 437–450, 2018. Action Editor: Trevor Cohn.
Submission batch: 9/2017; Revision batch: 12/2017; Published 7/2018.
c
2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
similar sentence pairs in the training set, which we where both prototype x0 and edit vector z are latent
can scalably approximate using locality sensitive variables.
hashing. We also show empirically that most lexi- Our formulation stems from the observation that
cally similar sentences are also semantically similar, many sentences in a large corpus can be represented
thereby endowing the neural editor with additional as minor transformations of other sentences. For ex-
semantic structure. For example, we can use the ample, in the Yelp restaurant review corpus (Yelp,
neural editor to perform a random walk from a seed 2017) we find that 70% of the test set is within word-
sentence to traverse semantic space. token Jaccard distance 0.5 of a training set sentence,
We compare our prototype-then-edit model to ap- even though almost no sentences are repeated ver-
proaches that generate from scratch on both lan- batim. This implies that a neural editor which mod-
guage generation quality and semantic properties. els lexically similar sentences should be an effective
For the former, our model generates higher qual- generative model for large parts of the test set.
ity generations according to human evaluations, and A secondary goal for the neural editor is to cap-
improves perplexity by 13 points on the Yelp cor- ture certain semantic properties; we focus on the fol-
pus and 7 points on the One Billion Word Bench- lowing two in particular:
mark. For the latter, we show that latent edit vec-
1. Semantic smoothness: an edit should be able to
tors outperform standard sentence variational au-
alter the semantics of a sentence by a small and
toencoders (Bowman et al., 2016) on semantic sim-
well-controlled amount, while multiple edits
ilarity, locally-controlled text generation, and a sen-
should make it possible to accumulate a larger
tence analogy task.
change.
2 Problem statement 2. Consistent edit behavior: the edit vector z
should model/control the variation in the type
Our primary goal is to learn a generative model of
of edit that is performed. When we apply
sentences for use as a language model.1 In partic-
the same edit vector on different sentences,
ular, we model sentence generation as a prototype-
the neural editor should perform semantically
then-edit process:
analogous edits across the sentences.
1. Select prototype: Given a training corpus of In Section 4, we show that the neural editor can
sentences X , randomly sample a prototype sen- successfully capture both properties, as reported by
tence x0 from a prototype distribution p(x0 ) (in human evaluations.
our case, uniform over X ).
2. Edit: Sample an edit vector z (encoding the 3 Approach
type of edit to be performed) from an edit prior We would like to train our neural editor pedit (x |
p(z). Then, feed the edit vector z and the previ- x0 , z) by maximizing the marginal likelihood (Equa-
ously selected prototype x0 into a neural editor tion 1) via gradient ascent, but the objective cannot
pedit (x | x0 , z), which generates a new sentence be computed exactly because it involves a sum over
x. all prototypes x0 (expensive) and an expectation over
the edit prior p(z) (no closed form).
Under this model, the likelihood of a sentence is:
X We therefore propose two approximations to over-
p(x) = p(x | x0 )p(x0 ) (1) come these challenges:
x0 ∈X
  1. We lower bound the sum over latent prototypes
p(x | x ) = Ez∼p(z) pedit (x | x0 , z) ,
0
(2) x0 (in Equation 1) by only summing over x0 that
1
For many applications such as machine translation or dia- are lexically similar to x.
logue generation, there is a context (e.g. foreign sentence, di- 2. We lower bound the expectation over the edit
alogue history), which can be supplied to both the prototype
selector and the neural editor. This paper focuses on the uncon-
prior (in Equation 2) using the evidence lower
ditional case, proposing an alternative to LSTM based language bound (ELBO) (Jordan et al., 1999; Doersch,
models. 2016) which can be effectively approximated.

438
We describe and motivate these approximations in Note that LEX(x) is still intractable to compute
Sections 3.1 and 3.2, respectively. In Section 3.3, because each log p(x|x0 ) term involves an expecta-
we combine the two approximations to give the final tion over the edit prior p(z) (Equation 2). We ad-
objective. Sections 3.4 and 3.5 drill down further dress this in Section 3.2, but first, an interlude.
into our specific model architecture.
Interlude: lexical similarity semantics. So far,
3.1 Approximate sum on prototypes, x0 we have motivated lexical similarity neighborhoods
via computational considerations, but we found that
Equation 1 defines the probability of generating a lexical similarity training also captures semantic
sentence x as the total probability of reaching x similarity. One can certainly construct sentences
via edits from every prototype x0 ∈ X . However, with small lexical distance that differ semantically
most prototypes are unrelated and should have very (e.g., insertion of the word “not”). However, since
small probability of transforming into x. Therefore, we mine sentences from a corpus grounded in real
we approximate the summation over prototypes by world events, most lexically similar sentences are
only considering prototypes x0 that have high lex- also semantically similar. For example, given “my
ical overlap with x. To that end, define a lexical son enjoyed the delicious pizza”, we are far more
similarity neighborhood as: likely to see “my son enjoyed the delicious maca-
roni”, versus “my son hated the delicious pizza”.
def
N (x) = {x0 ∈ X : dJ (x, x0 ) < 0.5}, Human evaluations of 250 edit pairs sampled
from lexical similarity neighborhoods on the Yelp
where dJ (x, x0 ) is the Jaccard distance between x corpus support this conclusion. 35.2% of the sen-
and x0 (treating each as a set of word tokens). tence pairs were judged to be exact paraphrases,
We will now lower bound log p(x) in two while 84% of the pairs were judged to be at least
ways: (i) we will sum over only prototypes in the roughly equivalent. Sentence pairs were negated or
neighborhood N (x) rather than over the entire change in topic only 7.2% of the time. Thus, a neu-
training set X as discussed above; (ii) we will ral editor trained on this distribution should prefer-
push the log inside the summation using Jensen’s entially generate semantically similar edits.
inequality, as is standard with variational lower Note that semantic similarity is not needed if we
bounds. Recall that the distribution over prototypes are only interested in modeling the distribution p(x).
is uniform (p(x0 ) = 1/|X |), and define R(x) = But it does enable us to learn an edit model p(x|x0 )
log(|N (x)|/|X |). The derivation is as follows: that prefers semantically meaningful edits, which we
" # explore in Section 4.3.
X
0 0
log p(x) = log p(x | x )p(x )
3.2 Approximate expectation on edit vectors, z
x0 ∈X
 
(i) X In Section 3.1, we approximated the marginal like-
≥ log  p(x | x0 )p(x0 ) (3) lihood log p(x) by LEX(x), which is a summation
x0 ∈N (x)
  over terms of the form:
X
= log |N (x)|−1 p(x | x0 ) + R(x)  
x0 ∈N (x)
log p(x | x0 ) = log Ez∼p(z) pedit (x | x0 , z) . (4)
(ii) X
≥ |N (x)|−1 log p(x | x0 ) +R(x). Unfortunately the expectation over p(z) has no
x0 ∈N (x)
| {z } closed form, and naively approximating it by Monte
def
= LEX(x) Carlo sampling z ∼ p(z) will have unacceptably
Assuming the neighborhood size |N (x)| is constant high variance, because pedit (x | x0 , z) will be almost
across x, then LEX(x) is a lower bound of log p(x) zero for nearly all z sampled from p(z), while being
up to constants. For each x, the neighborhood large for a few important but rare values.
N (x) can be efficiently precomputed with locality To address this, we introduce an inverse neural
sensitive hashing (LSH) and minhashing. The full editor q(z | x0 , x): given a prototype x0 and a revised
procedure is described in Appendix 6. sentence x, it generates edit vectors that are likely to

439
map x0 to x, concentrating probability on the few input sequence and the revised sentence x is the out-
rare but important values of z. put sequence. We employ an encoder-decoder archi-
We can then use the evidence lower bound tecture similar to Wu (2016), extending it to condi-
(ELBO) to lower bound Equation 4: tion on an edit vector z by concatenating z to the
input of the decoder at each time step.
 
log p(x|x0 ) ≥ Ez∼q(z|x0 ,x) log pedit (x | x0 , z) The prototype encoder is a 3-layer bidirectional
| {z } LSTM. The inputs to each layer are the concatena-
Lgen
0
tion of the forward and backward hidden states of
− KL(q(z | x , x) k p(z)) the previous layer, with the exception of the first
| {z }
LKL layer, which takes word vectors initialized using
def
= ELBO(x, x0 ). GloVe (Pennington et al., 2014).
The decoder is a 3-layer LSTM with attention. At
Since Lgen is an expectation over q(z | x0 , x) instead each time step, the hidden state of the top layer is
of p(x), it can be effectively Monte Carlo estimated used to compute attention over the top-layer hidden
by sampling z ∼ q(z | x0 , x). The second term, states of the prototype encoder. The resulting atten-
LKL , penalizes the difference between q(z | x0 , x) tion context vector is then concatenated with the de-
and p(x), which is necessary for the lower bound coder’s top-layer hidden state and used to compute a
to hold. A thorough introduction to the ELBO is softmax distribution over output tokens.
provided in Doersch (2016). Edit prior p(z). We sample the edit vector z from
Note that q(z | x0 , x) and pedit (x | x0 , z) combine the prior by first sampling its scalar length znorm ∼
to form a variational autoencoder (VAE) (Kingma Unif(0, 10) and then sampling its direction zdir (a
and Welling, 2014), where q(z | x0 , x) is the varia- unit vector) from the uniform distribution on the unit
tional encoder and pedit (x | x0 , z) is the variational sphere. The resulting z = znorm · zdir . As we will see
decoder. later, this particular choice of the prior enables us to
easily compute LKL .
3.3 Final objective
Inverse neural editor q(z | x0 , x). Given an edit
Combining the lower bounds LEX(x) and
pair (x0 , x), the inverse neural editor must infer what
ELBO(x, x0 ), our final approximation of the
vectors z are likely to map x0 to x.
log-likelihood is
Suppose that x0 and x only differed by a single
X word w. Then one might propose that the edit vector
ELBO(x, x0 ).
z should be equal to the word vector for w. Gener-
x0 ∈N (x)
alizing this intuition to multi-word edits, we would
like multi-word insertions to be represented as the
We optimize this objective using stochastic gradi-
sum of the inserted word vectors, and similarly for
ent ascent with respect to Θ = (Θp , Θq ), where Θp
deletions.
are the parameters for the neural editor and Θq are
Formally, define I = x\x0 to be the set of words
the parameters for the inverse neural editor.
added to x0 , and D = x0 \x to be the words deleted.
We represent the difference between x0 and x using
3.4 Model architecture
the following vector:
To recap, our model features three components: the
 X X
neural editor pedit (x | x0 , z), the edit prior p(z), f x, x0 = Φ (w) ⊕ Φ (w)
and the inverse neural editor q(z | x0 , x). We detail w∈I w∈D
each of these components below.
where Φ(w) is the word vector for word w and ⊕
Neural editor pedit (x | x0 , z).
We implement our denotes concatenation. The word embeddings Φ are
neural editor as a left-to-right sequence-to-sequence parameters of q. In our work, we initialize Φ(w) to
model with attention, where the prototype x0 is the be 300-dimensional GloVe vectors.

440
Since we construct our edit vectors as the sum of can then write:
word vectors, and similarities between word vectors  
have traditionally been measured with cosine simi- ∇Θq Lgen = ∇Θq Ez∼q(z|x0 ,x) log pedit (x | x0 , z)
 
larity, we design q to add noise to perturb the direc- = Eα∼p(α) ∇Θq log pedit (x | x0 , h(α)) .
tion of the vector f . In particular, a sample from q is
simply a perturbed version of f : obtained by adding This moves the derivative inside the expectation.
von-Mises Fisher (vMF) noise, and we perturb the The inner derivative can now be computed via stan-
magnitude of f by adding uniform noise. We visu- dard backpropagation.
alize this perturbation process in Figure 2. Next, we turn to ∇Θq LKL . First, note that:

f(x, x') random rotation


vMF
random rescaling z
LKL = KL(q(znorm |x0 , x)kp(znorm ))
Uniform
+ KL(q(zdir |x0 , x)kp(zdir )). (5)
{
It is easy to verify that the first KL term does not
Figure 2: The inverse neural editor q outputs a perturbed
depend on Θq . The second term has the closed form
version of f (x, x0 ). The perturbation process is a random
rotation (according to the von-Mises Fisher distribution)
Id/2 (κ) + Id/2−1 (κ) d−2
followed by a random rescaling (according to the uniform KL(vMF(µ, κ)kvMF(µ, 0)) = κ 2κ
d−2
distribution). Id/2−1 (κ) − 2κ
− log(Id/2−1 (κ)) − log(Γ(d/2))
Formally, let fnorm = kf k and fdir = f /fnorm . Let
+ log(κ)(d/2 − 1) − (d − 2) log(2)/2, (6)
vMF (v; µ, κ) denote a vMF distribution over points
v on the unit sphere (i.e., directions) with mean vec- where In (κ) is the modified Bessel function of
tor µ and concentration parameter κ (in such a distri- the first kind, Γ is the gamma function, and d
bution, the log-likelihood of a point decays linearly is the dimensionality of f . We can see that
with its cosine similarity to µ, and the rate of decay this too is constant with respect to Θq via the
is controlled by κ). Finally, define: following intuition: both the KL divergence and
the prior do not change under rotations, and
q(zdir | x0 , x) = vMF (zdir ; fdir , κ) thus we can see KL(vMF(µ, κ)kvMF(µ, 0))) =
q(znorm | x0 , x) = Unif(znorm ; [f˜norm , f˜norm + ]) KL(vMF(e1 , κ)kvMF(e1 , 0))) by rotating µ to the
first canonical basis vector. Hence ∇Θq LKL = 0.

where f˜norm = min(fnorm , 10 − ) is the truncated Comparison with existing VAE encoders. Our
norm. The resulting edit vector is z = zdir · znorm . design of q differs from the typical choice of a
The inverse neural editor q is parameterized by standard normal distribution (Bowman et al., 2016;
the word vectors Φ and has hyperparameters κ and Kingma and Welling, 2014) for two reasons:
. Further details are provided in Section 3.5. First, by construction, edit vectors are sums of
word vectors and since cosine distances are tradi-
3.5 Details of the inverse neural editor tionally used to measure distances between word
vectors, it would be natural to encode distances be-
Differentiating w.r.t. Θq . To maximize our tween edit vectors by the cosine distance. The von-
training objective, we must be able to compute Mises Fisher distribution captures this idea, as the
∇Θq ELBO(x, x0 ) = ∇Θq Lgen − ∇Θq LKL . log likelihood decays with cosine similarity.
To compute ∇Θq Lgen , we use a reparameteriza- Second, our design of q allows us to explicitly
tion trick. Specifically, we can rewrite z ∼ q(z | control the tradeoff between the two terms in our ob-
x0 , x) as z = h(α) where h is a deterministic func- jective, Lgen and LKL . Note from equations 5 and 6
tion differentiable with respect to Θq and α ∼ p(α) that LKL is purely a function of the hyperparameters
is an auxiliary random variable not depending on Θq  and κ, and can thus be controlled exactly. By tak-
(the details of h and α are given in Appendix 6). We ing κ → 0 and  to the maximum norm, we can drive

441
LKL arbitrarily close to 0. As a tradeoff, smaller val- parison, we use the exact same architecture as
ues of κ produce a noisier edit vector, leading to a the decoder of N EURAL E DITOR.
smaller Lgen . We find a good balance by tuning κ. 2. KN5: a standard 5-gram Kneser-Ney language
In contrast, when using a Gaussian variational en- model in KenLM (Heafield et al., 2013).
coder, the KL term takes a different value per ex-
ample and cannot be explicitly controlled. Conse- 3. M EMORIZATION: generates by sampling a
quently, Bowman et al. (2016) and others have ob- sentence from the training set.
served that training tends to aggressively drive these
Perplexity. We start by evaluating N EURAL E DI -
KL terms to zero, leading to uninformative values
TOR ’s value as a language model, measured in terms
of z — even when multiplying LKL by a carefully
of perplexity. We use the likelihood lower bound
tuned and annealed importance weight.
in Equation 3, where we sum over training set in-
4 Experiments stances within Jaccard distance < 0.5, and for the
VAE term in N EURAL E DITOR, we use the one-
We divide our experimental results into two parts. In sample approximation to the lower bound used in
Section 4.2, we evaluate the merits of the prototype- Kingma (2014) and Bowman (2016).
then-edit model as a generative modeling strategy, To evaluate N EURAL E DITOR’s perplexity, we use
measuring its improvements on language modeling linear smoothing with NLM to account for rare
(perplexity) and generation quality (human evalua- sentences not within our Jaccard distance thresh-
tions of diversity and plausibility). In Section 4.3, old. This smoothing corresponds to occasionally
we focus on the semantics learned by the model and sampling a special prototype sentence that can be
its latent edit vector space. We demonstrate that edited into any other sentence and we use a smooth-
it possesses interpretable semantics, enabling us to ing weight of 0.1 (for full details, see Appendix
smoothly control the magnitude of edits, incremen- 6). We find N EURAL E DITOR improves perplexity
tally optimize sentences for target properties, and over NLM and KN5. Table 1 shows that this is the
perform analogy-style sentence transformations. case for both Y ELP and the more general B ILLION -
W ORD, which contains substantially fewer test-set
4.1 Datasets
sentences close to the training set. On Y ELP, we
We evaluate perplexity on the Yelp review corpus surpass even the best ensemble of NLM and KN5,
(Y ELP, Yelp (2017)) and the One Billion Word Lan- while on B ILLION W ORD we nearly match their per-
guage Model Benchmark (B ILLION W ORD, Chelba formance.
(2013)). For qualitative evaluations of generation Comparing each model at a per-sentence level, we
quality and semantics, we focus on Y ELP as our pri- see that N EURAL E DITOR drastically improves log-
mary test case, as we found that human judgments likelihood for a significant number of sentences in
of semantic similarity were much better calibrated the test set (Figure 3). Proximity to a prototype
in this focused setting. seems to be the chief determiner of N EURAL E DI -
For both corpora, we used the named-entity rec- TOR ’s performance.
ognizer (NER) in spaCy2 to replace named entities
with their NER categories. We replaced tokens out- Model
Perplexity Perplexity
side the top 10,000 most frequent tokens with an (Yelp) (B ILLION W ORD)
KN5 56.546 78.361
“out-of-vocabulary” token. KN5+M EMORIZATION 55.184 73.468
NLM 39.026 55.146
4.2 Generative modeling
NLM+M EMORIZATION 38.086 50.969
We compare N EURAL E DITOR as a language model NLM+KN5 37.312 47.472
against the following baseline language models: N EURAL E DITOR(κ = 0) 26.87 48.755
N EURAL E DITOR(κ = 25) 27.41 48.921
1. NLM: a standard left-to-right neural language
Table 1: Perplexity of N EURAL E DITOR with the two
model generating from scratch. For fair com-
VAE parameters κ outperform all methods on Y ELP and
2
honnibal.github.io/spaCy all non-ensemble methods on B ILLION W ORD.

442
Figure 3: N EURAL E DITOR outperforms NLM on examples similar to those in the training set (left panel, point size
indicates number of training set examples with Jaccard distance < 0.5). The N-gram baseline (right) shows no such
behavior, with NLM outperforming KN5 on most examples.
Prototype x0 Revision x a popular technique for suppressing incoherent and
this place gets <cardinal> stars this place gets <cardinal> stars
for its diversity in its menu . although not for the prices . ungrammatical sentences. Many NLM systems have
great food and the happy hour the deals are great and the food noted an undesirable tradeoff between grammatical-
deals were out of this world . is out of this world . ity and diversity, where a temperature low enough to
i’ve been going to <person> for i’ve been going to this place for
<date> and i used to really like <date> now and love it . enforce grammaticality results in short and generic
this place . utterances (Li et al., 2016).
their food is great , and you can’t you can’t beat the service and
beat the price . food here .
Figure 4 illustrates that both the grammatical-
ity and plausibility of N EURAL E DITOR without any
Table 2: Edited generations are substantially different temperature annealing is on par with the best tuned
from the sampled prototypes. temperature for NLM, with a far higher diversity, as
measured by the discrete entropy over unigram fre-
Since N EURAL E DITOR draws its strength from quencies. We also find that decreasing the tempera-
sentences in the training set, we also compared ture of N EURAL E DITOR can be used to slightly im-
against a simpler alternative, in which we ensem- prove the grammaticality, without substantially re-
ble NLM and M EMORIZATION (retrieval without ducing the diversity of the generations.
edits). N EURAL E DITOR performs dramatically bet- Comparing with the synonym substitution model,
ter than this alternative. Table 2 also qualitatively we find both models have high plausibility, since
demonstrates that sentences generated by N EU - synonym substitution maintains most of the words,
RAL E DITOR are substantially different from the but low grammaticality compared to both N EU -
original prototypes. RAL E DITOR and the NLM. Additionally, applying
Human evaluation. We now turn to human eval- synonym substitutions to training examples has ex-
uation of generation quality, focusing on grammati- tremely low coverage – none of the sentences in the
cality and plausibility. We evaluated plausibility by test set can be generated via synonym substitution,
asking human raters, “How plausible is it for this and thus this baseline has higher perplexity than all
sentence to appear in the corpus?” on a scale of other baselines in Table 1.
1– 3. We evaluate generations from N EURAL E D - A key advantage of edit-based models thus
ITOR against an NLM with a temperature parame- emerges: Prototypes sampled from the training set
ter on the per-token softmax3 as well as a baseline organically inject diversity into the generation pro-
which generates sentences by randomly sampling cess, even if the temperature of the decoder in N EU -
from the training set and replacing synonyms, where RAL E DITOR is zero. Hence, we can keep the de-
the probability of substitution follows exp(sij /τ ), coder at a very low temperature to maximize gram-
where sij is the cosine similarity between the origi- maticality and plausibility, without sacrificing diver-
nal word and its synonym according to GloVe word sity. In contrast, a zero temperature NLM would
vectors. collapse to outputting one generic sentence.
Decreasing the temperature parameter below 1 is This also suggests that the temperature parame-
3
If si is the softmax logit for token wi and τ is a temperature
ter for N EURAL E DITOR captures a more natural no-
parameter, the temperature-adjusted distribution is p(wi ) ∝ tion of diversity — a temperature of 1.0 encourages
exp(si /τ ). more aggressive extrapolation from the training set

443
Figure 4: N EURAL E DITOR provides plausibility and grammaticality on par with the best, temperature-tuned language
model without any loss of diversity as a function of temperature. Results are based on 400 human evaluations.

while lower temperatures favor more conservative variational autoencoder (SVAE) which imposes se-
mimicking. This is likely to be more useful than mantic structure onto a latent vector space, but uses
the tradeoff for generation-from-scratch, where low the latent vector to represent the entire sentence,
temperature also affects the diversity of generations. rather than just an edit.
To use the SVAE to “edit” a target sentence into
Categorizing edits. To better understand the be- a semantically similar sentence, we perturb its un-
havior of N EURAL E DITOR, we measured the fre- derlying latent sentence vector and then decode the
quency with which random edits from N EURAL E D - result back into a sentence — the same method used
ITOR matched known syntactic transformations. in Bowman et al. (2016).
We use the rule-based transformations defined in
He (2015) as our set of transformations to test, and Semantic smoothness. A good editing system
search the corpus for sentences where these rules can should have fine-grained control over the semantics
be applied. We then apply the rule-based transfor- of a sentence: i.e., each edit should only alter the se-
mation, and measure the log-likelihood that N EU - mantics of a sentence by a small and well-controlled
RAL E DITOR generates the transformed outputs. We amount. We call this property semantic smoothness.
find that the edit model assigns relatively high prob- To study smoothness, we first generate an “edit
ability to the identity map (no edits), followed by sequence” by randomly selecting a prototype sen-
simple reordering transformations such as reorder- tence, and then repeatedly editing via N EURAL E D -
ing to/that Clauses (It is ADJP to/that SBAR/S → ITOR (with edits drawn from the edit prior p(z)) to
To S/BARS is ADJP). Of the rules, active / passive produce a sequence of revisions. We then ask human
receives the lowest probability, partially due to the annotators to rate the size of the semantic changes
rarity of passive voice sentences in the Yelp corpus between revisions. An example is given in Table 4.
(Table 3). We compare to two baselines, one based upon the
In all cases, the model assigns substantially higher sentence variational autoencoder (SVAE) and an-
probability to these rule-based transformations over other baseline which simply samples similar sen-
editing to random sentences or shuffling the tokens tences from the training set according to average
randomly to match the Levenstein distance of each word vector similarity (C OSINE).
rule-based transform. For SVAE, we generate a similar sequence of sen-
tences by first encoding the prototype sentence, and
4.3 Semantics of N EURAL E DITOR then decoding after the addition of a random Gaus-
sian with variance 0.4.4 This process is repeated to
In this section, we investigate the learned semantics
produce a sequence of sentences which we can view
of N EURAL E DITOR, focusing on the two desiderata
as the SVAE equivalent of the edit sequence.
discussed in Section 2: semantic smoothness, and
For C OSINE, we generate sentences from the
consistent edit behavior.
training set using exponentiated cosine similarity be-
In order to establish a baseline for these proper-
4
ties, we consider existing sentence generation tech- The variance was selected so that SVAE and N EURAL E D -
ITOR have the same average human similarity judgement be-
niques which can sample semantically similar sen- tween two successive sentences. This avoids situations where
tences. The most similar language modeling ap- SVAE produces completely unrelated sentence due to the per-
proach which can capture semantics is the sentence turbation size.

444
Type of edit − log(p) per token Example Transformed example
Identity 0.33 ± 0.006 It is important to remain watchful. It is important to remain watchful.
to Clause reordering 1.62 ± 0.156 It is important to remain watchful. To remain watchful is important.
Quotative verb reordering 2.02 ± 0.0359 They announced that the president The president will restructure the
will restructure the division. division, they announced.
Conjunction reversal 2.52 ± 0.0520 We should march because winter is Winter is coming, because of this,
coming. we should march.
Genitive reordering 2.678 ± 0.0477 The best restaurant of New York. New York’s best restaurant.
Active / passive 3.271 ± 0.0298 The talk was denied by the boycott The boycott group spokesman de-
group spokesman. nied the talk.
Random sentence reordering 4.42 ± 0.026 It is important to remain watchful. It remain is to important watchful.
Editing to random sentences 6.068 ± 0.084

Table 3: N EURAL E DITOR assigns high probabilities to the syntactic transformations defined in He (2015) compared
baselines of editing to random sentences or randomly reordering tokens to match the Levenstein distance of a syntactic
edit. Small transformations, such as clause reordering, receive higher probability than more structural edits such as
changing from active to passive voice.

N EURAL E DITOR SVAE


this food was amazing one of the best i’ve tried, service was fast and great. this food was amazing one of the best i’ve tried, service was fast and great.
this is the best food and the best service i’ve tried in <gpe>. this place is a great place to go if you want a quick bite.
some of the best <norp> food i’ve had in <date> i’ve lived in <gpe>. the food was good, but the service was terrible.
i have to say this is the best <norp> food i’ve had in <gpe>. this is the best <norp> food in <gpe>.
best <norp> food i’ve had since moving to <gpe> <date>. this place is a great place to go if you want to eat.
this was some of the best <norp> food i’ve had in the <gpe>. this is the best <norp> food in <gpe>.

Table 4: Example random walks from N EURAL E DITOR, where the top sentence is the prototype.

tween averaged word vectors. The temperature pa- to generate long, diverse sentences which smoothly
rameter for the exponential was selected as before to change over time, while the SVAE biases towards
match the average human similarity judgement. short sentences with several semantic jumps, pre-
Figure 5 shows that N EURAL E DITOR frequently sumably due to the difficulty of training a suffi-
generates paraphrases despite being trained on lex- ciently informative SVAE encoder.
ical similarity, and only 1% of edits are unrelated
from the prototype. In contrast, SVAE often repeats Smoothly controlling sentences. We now show
sentences exactly, and when it makes an edit it is that we can selectively choose edits sampled from
equally likely to generate unrelated sentences. C O - N EURAL E DITOR to incrementally optimize a sen-
SINE performs even worse likely due to the difficulty tence towards desired attributes. This task serves as
of retrieving similar sentences for rare and long sen- a useful measure of semantic coverage: if an edit
tences. model has high coverage over sentences that are se-
mantically similar to a prototype, it should be able to
satisfy the target attribute while deviating minimally
from the prototype’s original meaning.
We focus on controlling two simple attributes:
compressing a sentence to below a desired length
(e.g., 7 words), and inserting a target keyword into
the sentence (e.g., “service” or “pizza”).
Degenerate Less Similar By Turk Evaluation
Given a prototype sentence, we try to discover a
semantically similar sentence satisfying the target
Figure 5: Compared with baselines, N EURAL E DITOR
6
frequently generates paraphrases and similar sentences 545 similarity assessments pairs were collected through
while avoiding unrelated and degenerate ones.6 Amazon Mechanical Turk following Agirre (2014), with the
same scale and prompt. Similarity judgements were converted
to descriptions by defining Paraphrase (5), Roughly Equivalent
Qualitatively (Table 4), N EURAL E DITOR seems (4-3), Same Topic (2-1), Unrelated (0).

445
Figure 6: N EURAL E DITOR can shorten sentences (left), include common words (center, the word ‘service’) and rarer
words (right ‘pizza’) while maintaining similarity.

N EURAL E DITOR SVAE


the coffee ice cream was one of the best i’ve ever tried. the coffee ice cream was one of the best i’ve ever tried.
some of the best ice cream we’ve ever had! the <unk> was very good and the food was good.
just had the best ice - cream i’ve ever had! the food was good, but not great.
some of the best pizza i’ve ever tasted! the food was good, but not great.
that was some of the best pizza i’ve had in the area. the food was good, but the service was n’t bad.

Table 5: Examples of word inclusion trajectories for ‘pizza’. N EURAL E DITOR produces smooth chains that lead to
word inclusion, but the SVAE gets stuck on generic sentences.

attribute using the following procedure: First, we ally consistent semantics. Specifically, applying the
generate 1,000 edit sequences using the procedure same edit vector to different sentences should result
described earlier. Then, we select the sequence with in semantically analogous edits.
highest likelihood whose endpoint possesses the tar- For example, if we have an edit vector which ed-
get attribute. We repeat this process for a large num- its the sentence x1 = “this was a good restaurant”
ber of prototypes. into x2 = “this was the best restaurant”. Given a
We use almost the same procedure for the SVAE, new sentence y1 = “The cake was great”, we expect
but instead of selecting by highest likelihood, we applying the same edit vector to result in y2 = “The
select the sequence whose endpoint has shortest la- cake was the greatest”.
tent vector distance from the prototype (as this is the Formally, suppose we have two sentences, x1 and
SVAE’s metric of semantic similarity). x2 , which are related by some underlying semantic
In Figure 6, we then aggregate the sentences from relation r. Given a new sentence y1 , we would like
the collected edit sequences, and plot their seman- to find a y2 such that the same relation r holds be-
tic similarity to the prototype against their success tween y1 and y2 .
in satisfying the target attribute. Not surprisingly, as Our approach is to estimate the edit vector be-
target attribute satisfaction rises, semantic similar- tween x1 and x2 as ẑ = f (x1 , x2 ) — the mode
ity drops. However, we also see that N EURAL E D - of the inverse neural editor q. We then apply this
ITOR sacrifices less semantic similarity to achieve
edit vector to y1 using the neural editor to yield
the same level of attribute satisfaction as SVAE. yˆ2 = argmaxx pedit (x | y1 , ẑ).
SVAE is reasonable on tasks involving common
Since it is difficult to output yˆ2 exactly matching
words (such as the word service), but fails when the
y2 , we take the top k candidate outputs of pedit (us-
model is asked to generate rarer words such as pizza.
ing beam search) and evaluate whether the gold y2
Examples from these word inclusion problems show
appears among the top k elements.
that SVAE often becomes stuck generating short,
We generate the semantic relations r using prior
generic sentences (Table 5).
evaluations for word analogies (Mikolov et al.,
Consistent edit behavior: sentence analogies. In 2013a; Mikolov et al., 2013b). We leverage these
the previous results, we showed that edit models to generate a new dataset of sentence analogies, us-
learn to generate semantically similar sentences. We ing a simple strategy: given an analogous word pair
now assess whether the edit vector possesses glob- (w1 , w2 ), we mine the Yelp corpus for sentence pairs

446
Google Microsoft Method
gram4-superlative gram3-comparative family JJR_JJS VB_VBD VBD_VBZ NN_NNS VB_VBZ JJ_JJR
0.45 0.85 1.0 0.75 0.63 0.82 0.82 0.61 0.77 GloVE
0.75 0.75 0.29 0.79 0.57 0.60 0.58 0.41 0.24 Edit vector (top 10)
0.60 0.32 0.01 0.45 0.16 0.23 0.17 0.01 0.06 Edit vector (top 1)
0.10 0.10 0.09 0.10 0.08 0.14 0.15 0.05 0.03 Sampling (top 10)

Table 6: Edit vectors capture one-word sentence analogies with performance close to lexical analogies.

Example 1 Example 2
Context he comes home tired and happy . i went with a larger group to <person> ’s .
Edit + was - is + good - better
Result = he came home happy and tired . = i went to <person> ’s with a large group .

Table 7: Examples of lexical analogies correctly answered by N EURAL E DITOR. Sentence pairs generating the analogy
relationship are shortened to only their lexical differences.

(x1 , x2 ) such that x1 is transformed into x2 by in- (Kalchbrenner and Blunsom, 2013; Hahn and Mani,
serting w1 and removing w2 (allowing for reordering 2000; Ritter et al., 2011). Our work is motivated by
and inclusion/exclusion of stop words). an emerging consensus that attention-based mecha-
For this task, we initially compared against the nisms (Bahdanau et al., 2015) can substantially im-
SVAE, but it had a top-k accuracy close to zero. prove performance on various sequence to sequence
Hence, we instead compare to S AMPLING which is tasks by capturing more information from the input
a baseline which randomly samples an edit vector sequence (Vaswani et al., 2017). Our work extends
ẑ ∼ p(z), instead using ẑ derived from f (x1 , x2 ). the applicability of attention mechanisms beyond
We also compare our accuracies to the simpler sequence-to-sequence models by allowing models to
task of solving the word, rather than sentence-level attend to randomly sampled sentences.
analogies in (Mikolov et al., 2013a) using GloVe. There is a growing literature on applying retrieval
This task is substantially simpler, since the goal is to mechanisms to augment text generation models. For
identify a single word (such as “good:better::bad:?”) example, in the image captioning literature, Hodosh
instead of an entire sentence. Despite this, the top- (2013), Kuznetsova (2013) and Mason (2014) pro-
10 performance of our model in Table 6 is nearly as posed to generate image captions by first retriev-
good as the performance of GloVe vectors on the ing a prototype caption based on an image con-
simpler lexical analogy task. In some categories, text, and then applying sentence compression to tai-
N EURAL E DITOR at top-10 actually performs better lor the prototype to a particular image. More re-
than word vectors, since N EURAL E DITOR has an cently, Song (2016) ensembled a retrieval system
understanding of which words are likely to appear and an NLM for dialogue, using the NLM to trans-
in the context of a Yelp review. Examples in Table form the retrieved utterance, and Gu (2017) used
7 show the model is accurate and captures lexical an off-the-shelf search engine system to retrieve
analogies requiring word reorderings. and condition on training set examples. Although
these approaches also edit text from the training set,
5 Related work and discussion these papers solve a fundamentally different prob-
Our work connects with a broad literature on lem since they solve conditional generation prob-
attention-based neural models, retrieval-augmented lems, and retrieve prototypes based on a context,
text generation, semantically meaningful represen- where as our task is unconditional and thus there is
tations, and nonparametric statistics. no context which we can use to retrieve.
Based upon recurrent neural networks and Our work treats the prototype x0 as a latent vari-
sequence-to-sequence architectures (Sutskever et able rather than being given by a retrieval mecha-
al., 2014), neural language models (Bengio et al., nism, and marginalizes over all possible prototypes
2003) have been widely used due to their flexibility — a challenge which motivates our new lexical sim-
and performance across a wide range of NLP tasks ilarity training method in Section 3.1. Practically,

447
marginalization over x0 makes our model attend to 6 Appendix
training examples based on similarity of output se-
Construction of the LSH. The LSH maps a sen-
quences, while prior retrieval models attend to ex-
tence to lexically similar sentences in the corpus,
amples based on similarity of the input sequences.
representing a graph over sentences. We apply
In terms of generation techniques that capture breadth-first search (BFS) over the LSH sentence
semantics, the sentence variational autoencoder graph started at randomly selected seed sentences
(SVAE) (Bowman et al., 2016) is closest to our work and uniformly sample this set to form the training
in that it attempts to impose semantic structure on a set.
latent vector space. However, the SVAE’s latent vec-
tor is meant to represent the entire sentence, whereas Reparameterization trick for q. First, note that
the neural editor’s latent vector represents an edit. we can write znorm ∼ q(znorm |x0 , x) as znorm =
def
Our results from Section 4.3 suggest that local vari- hnorm (αnorm ) = f˜norm + αnorm where αnorm ∼
ation over edits is easier to model than global varia- Unif(0, ). Furthermore, Wood (1994) present a
tion over sentences. function hdir and auxiliary random variable αdir ,
Our use of lexical similarity neighborhoods is such that zdir = hdir (αdir ) is distributed according
comparable to context windows in word vector train- to a vMF with mean f and concentration κ. We can
def
ing (Mikolov et al., 2013a). More generally, results then define z = h(α) = hdir (αdir ) · hnorm (αnorm ).
in manifold learning demonstrate that a weak met-
ric such as lexical similarity can be used to extract
semantic similarity through distributional statistics
(Tenenbaum et al., 2000; Hashimoto et al., 2016).
From a generative modeling perspective, edit-
ing randomly sampled training sentences closely
Figure 7: Small amounts of smoothing are sufficient to
resembles nonparametric kernel density estimation make N EURAL E DITOR outperform the baseline NLM.
(Parzen, 1962) where one samples points from a
training set, and adds noise to smooth the den- Smoothing for language models. As a language
sity. Our edit model is the text equivalent of Gaus- model, N EURAL E DITOR does not place probability
sian noise, and our training mechanism is a type of on any test sentence which is sufficiently dissimi-
learned smoothing kernel. lar from all training set sentences. In order to avoid
Prototype-then-edit is a semi-parametric ap- this problem, we can consider a special prototype
proach that remembers the entire training set and sentence ‘∅’ which can be edited into any sentence,
uses a neural editor to generalize meaningfully be- and draw this special prototype with probability p∅ .
yond the training set. The training set provides a Concretely, we write:
X
strong inductive bias — that the corpus can be char- p(x) = pedit (x|x0 )pprior (x0 )
acterized by prototypes surrounded by semantically x0 ∈X ∪{∅}
similar sentences reachable by edits. Beyond im- X 1
= (1 − p∅ ) pedit (x|x0 ) + p∅ pNLM (x).
provements on generation quality as measured by 0
|X
x ∈X
|
perplexity, the approach also reveals new semantic
structures via the edit vector. This linearly smoothes between our edit model
(pedit ) and the NLM (pNLM ) since our decoder is
Reproducibility. All code, data and experiments identical to the NLM, and thus conditioning on the
are available on the CodaLab platform at https: special ∅ token reduces to using a NLM.
//bit.ly/2rHsWAX. Empirically, we observe that even small values of
Acknowledgements. We thank the reviewers and p∅ produces low perplexity (Figure 7) corresponding
editor for their insightful comments. This work was to the observation that smoothing of N EURAL E D -
funded by DARPA CwC program under ARO prime ITOR is only necessary to avoid degenerate log-
contract no. W911NF-15-1-0462. likelihoods on a very small subset of the test set.

448
References Natural Language Processing (EMNLP), pages 1700–
1709.
E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab,
D. P. Kingma and M. Welling. 2014. Auto-encoding
A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau,
variational Bayes. arXiv preprint arXiv:1312.6114.
and J. Wiebe. 2014. SemEval-2014 Task 10: Mul-
P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
tilingual semantic textual similarity. In International
Y. Choi. 2013. Generalizing image captions for
Conference on Computational Linguistics (COLING),
image-text parallel corpus. In Association for Com-
pages 81–91.
putational Linguistics (ACL), pages 790–796.
D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural ma-
J. Li, M. Galley, C. Brockett, J. Gao, and W. B. Dolan.
chine translation by jointly learning to align and trans-
2016. A diversity-promoting objective function for
late. In International Conference on Learning Repre-
neural conversation models. In Human Language
sentations (ICLR).
Technology and North American Association for Com-
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. putational Linguistics (HLT/NAACL), pages 110–119.
A neural probabilistic language model. Journal of ma-
R. Mason and E. Charniak. 2014. Domain-specific im-
chine learning research, 3(0):1137–1155.
age captioning. In Computational Natural Language
S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Joze- Learning (CoNLL), pages 2–10.
fowicz, and S. Bengio. 2016. Generating sentences
T. Mikolov, K. Chen, G. Corrado, and Jeffrey. 2013a.
from a continuous space. In Computational Natural
Efficient estimation of word representations in vector
Language Learning (CoNLL), pages 10–21.
space. arXiv preprint arXiv:1301.3781.
C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants,
T. Mikolov, W. Yih, and G. Zweig. 2013b. Linguis-
P. Koehn, and T. Robinson. 2013. One billion word
tic regularities in continuous space word representa-
benchmark for measuring progress in statistical lan-
tions. In Human Language Technology and North
guage modeling. arXiv preprint arXiv:1312.3005.
American Association for Computational Linguistics
C. Doersch. 2016. Tutorial on variational autoencoders. (HLT/NAACL), volume 13, pages 746–751.
arXiv preprint arXiv:1606.05908.
E. Parzen. 1962. On estimation of a probability density
J. Gu, Y. Wang, K. Cho, and V. O. Li. 2017. Search function and mode. Annals of Mathematical Statistics,
engine guided non-parametric neural machine transla- 33:1065–1076.
tion. arXiv preprint arXiv:1705.07267. J. Pennington, R. Socher, and C. D. Manning. 2014.
U. Hahn and I. Mani. 2000. The challenges of automatic GloVe: Global vectors for word representation. In
summarization. Computer, 33. Empirical Methods in Natural Language Processing
T. B. Hashimoto, D. Alvarez-Melis, and T. S. Jaakkola. (EMNLP), pages 1532–1543.
2016. Word embeddings as metric recovery in seman- A. Ritter, C. Cherry, and W. B. Dolan. 2011. Data-driven
tic spaces. Transactions of the Association for Com- response generation in social media. In Empirical
putational Linguistics (TACL), 4:273–286. Methods in Natural Language Processing (EMNLP),
J. R. Hayes and L. S. Flower. 1986. Writing research and pages 583–593.
the writer. American psychologist, 41(10):1106–1113. L. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, and
H. He, A. G. II, J. Boyd-Graber, and H. D. III. R. Kurzweil. 2017. Generating high-quality and
2015. Syntax-based rewriting for simultaneous ma- informative conversation responses with sequence-to-
chine translation. In Empirical Methods in Natural sequence models. In Empirical Methods in Natural
Language Processing (EMNLP), pages 55–64. Language Processing (EMNLP), pages 2210–2219.
K. Heafield, I. Pouzyrevsky, J. H. Clark, and P. Koehn. Y. Song, R. Yan, X. Li, D. Zhao, and M. Zhang. 2016.
2013. Scalable modified Kneser-Ney language model Two are better than one: An ensemble of retrieval-
estimation. In Association for Computational Linguis- and generation-based dialog systems. arXiv preprint
tics (ACL), pages 690–696. arXiv:1610.07149.
M. Hodosh, P. Young, and J. Hockenmaier. 2013. Fram- I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Se-
ing image description as a ranking task: Data, models quence to sequence learning with neural networks. In
and evaluation metrics. Journal of Artificial Intelli- Advances in Neural Information Processing Systems
gence Research (JAIR), 47:853–899. (NIPS), pages 3104–3112.
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. J. B. Tenenbaum, V. D. Silva, and J. C. Langford. 2000.
Saul. 1999. An introduction to variational methods A global geometric framework for nonlinear dimen-
for graphical models. Machine Learning, 37:183–233. sionality reduction. Science, pages 2319–2323.
N. Kalchbrenner and P. Blunsom. 2013. Recurrent con- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
tinuous translation models. In Empirical Methods in L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin.

449
2017. Attention is all you need. arXiv preprint
arXiv:1706.03762.
A. T. Wood. 1994. Simulation of the von mises fisher
distribution. Communications in statistics-simulation
and computation, pages 157–164.
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Yelp. 2017. Yelp Dataset Challenge, Round 8. https:
//www.yelp.com/dataset_challenge.

450

Das könnte Ihnen auch gefallen