Sie sind auf Seite 1von 5

Unsupervised Style Transfer for Text

Robert Gordan Mingu Kim

Abstract style. With this approach, the decoders must capture all the
Sequence to sequence translation is a fundamen- information related to a given style. The problem then be-
tal problem in natural language processing that comes one of aligning the latent vectors created by the en-
aims to learn a mapping between two sets of se- coder network from each style. One way to attempt to do
quences. Research on this problem has largely this is to use a Variational Autoencoder architecture. This
focused on supervised contexts, where examples directly aligns the distribution of the latent representation
in each set of sequences are paired with each with a fixed prior, such as a multivariate normal distribu-
other. Machine translation is one area where tion, meaning the model can be trained purely using re-
large quantities of supervised data exist for some construction loss. Unfortunately, this does not seem to be
language pairs. In this paper we explore ap- a strong enough constraint to align the latent distributions
proaches to the unsupervised analog of the se- well. Other ideas include the use of adversarial training
quence to sequence mapping problem, where we to achieve alignment. Discriminators can be applied both
assume that a satisfactory mapping exists and directly to the latent representations to ensure alignment,
then attempt to learn this mapping. We explore and to the generated sentences in order to ensure they are
several approaches to this problem as applied distributed similarly to the true sentences of that style. In
to transfer between different styles in the same theory, this latter method aligns the latent representations
language. We present models employing back- indirectly. It also introduces the problem of propagating
translation ideas and sentence encoders trained gradients through the text generation process.
on supervised tasks recovered from the data. Fi- As we have noted, there is a strong parallelism between
nally, we present qualitative evaluations of our translation and style transfer, and we explored an idea from
model based on transferred sample sentences, this literature, that of backtranslation, on the style transfer
and quantitative metrics based on pretrained domain. Backtranslation consists of transferring text from
classifier models and paired style data. one style to another, then back to the original style, where
reconstruction loss can be used. Gradients can only be re-
covered for the latter half of this process.
1. Introduction
We have also explored the idea of using a fixed encod-
Learning sequence to sequence mappings is a well-studied ing model trained on a supervised task to generate content
problem for supervised data. Work on unsupervised se- representations in the latent space. If the task is closely
quence to sequence models has approached the problem aligned with the definition of content we have adopted,
both from a ”style transfer” and an unsupervised machine then the sentence embeddings generated by this model will
translation perspective. The meaning of style in the con- be aligned without any intervention. Accordingly, we ex-
text of natural language is relatively unintuitive, when com- perimented with a skip-thought like sentence embedding
pared to image style. We consider content to be all ele- model, which learns sentence representations as inputs to
ments of natural language relevant to its semantic meaning, a classification task in which the goal is to choose the sen-
and style to be all perturbations of natural language that do tences adjacent to the target sentence.
not affect this semantic meaning. In this sense, style trans-
A persistent problem in the literature of unsupervised style
fer and machine translation are identical problems, because
transfer is the lack of reliable metrics to evaluate models.
this interpretation of style could be extended to encompass
We look at cross-domain reconstruction loss as well and
linguistic differences.
qualitatively analyze transferred examples. Following the
Previous work on this problem has mostly modeled the example of past work, we use a yelp review dataset, com-
problem as one of learning a single latent vector represent- posed of positive and negative reviews.
ing content (for each sentence), and then used separate de-
coder networks, or a single decoder network with a style
indicator feature, to render the sentence in the appropriate
CS 287 Final Project Template

2. Background hidden states of the RNN. While originally conceived of as


a method for regularizing RNNs, this process helps to avoid
The problem of unsupervised style transfer builds on the the problem that one-hot word vectors would be easily dis-
rich literature of recurrent neural network architectures. tinguishable from softmax ones to a discriminator. Never-
These architectures, such as LSTM and GRU have demon- theless, their results do not demonstrate convincingly that
strated incredible success on supervised tasks. They form their models learn an invertible cross-domain mapping (i.e.
the encoders and decoders and are thus the base of our one that preserves content).
model.
We draw on the unsupervised machine translation literature
While the neural network building blocks of our model (Lample et al., 2017) for inspiration for our first model.
have showed success, the data that is available has prop- Lample et al. combine an adversarial discriminator on the
erties that may conflict with our assumptions or render our latent code with a cycle-loss-like ”backtranslation” pro-
problem more difficult. First, we assume that the two con- cess, in which a transferred sentence is transferred back and
tent distributions (one for each style dataset) are identical, compared to the original sentence. They do not propagate
but this is not necessarily true. Secondly, intuitively, style gradients through the text generation process. Additionally,
may be organized at a higher level than the individual sen- they use a Denoising Autoencoder (Vincent et al., 2008),
tence, which is not captured in our data. In some texts, meaning that they perturb inputs with deletions and word
style and semantic meaning may be inextricably linked, es- shuffles before attempting to encode and reconstruct. This
pecially literary works. We can comfortably ignore these encourages local smoothness in the latent space. They ap-
relative edge cases, because expecting neural networks to ply the same process to transferred text before performing
learn such abstract relationships may be far-fetched in any backtranslation.
case, but common datasets such as yelp reviews may not
line up with our notions of style and content on a more fun- Finally, we make use of work on latent sentence represen-
damental level. Is sentiment style or is it content? While tations. (Logeswaran & Lee, 2018) Logeswaran et al. train
these issues challenge our understanding of the problem their embedding model with a classification task in which a
and our definitions of content and style, they don’t neces- classification network must choose which of several candi-
sarily inhibit the training of our model. date sentences are neighbors in the original text to a given
sentence. This process is very fast compared to other neural
network based sentence encoding approaches, which is part
3. Related Work of its appeal. Learning good latent representation of sen-
Unsupervised image-to-image translation models (Zhu tences that exclusively capture semantic meaning is closely
et al., 2017) such as CycleGAN provide formidable exam- related to the style transfer because it would provide an in-
ples of the power of the adversarial objective to achieve a variant definition of content. This would extract much of
mapping between unpaired image data. The fundamental the underconstrained nature from the problem, making it
innovation of CycleGAN is the use of the ”cycle loss” to much easier.
further constrain the mapping that is learned to be invert-
ible. While the natural language domain poses significant 4. Model
additional challenges because of the discrete nature of text,
we nonetheless draw inspiration from the cycle loss objec- In unsupervised style transfer, we have two sets of sen-
tive in our use of ”backtranslation” for style transfer. tences X1 and X2 . For each example sentence, x1 ∈ X and
x2 ∈ X2 , we consider that it is a deterministic function of
Most similar to our work in goals is that of Shen et al. a content random variable z sampled from p(z), a distribu-
(Shen et al., 2017) They evaluate their style transfer model tion over content. This content distribution p(z) is assumed
based on sentiment modification performance, as well as on to be the same for both sets of sentences. The deterministic
artificial deterministic mappings. The two mappings they invertible function is unique to each set of sentences. This
choose are word substitution and word scrambling. Besides transformation represents the stylistic qualities of the sen-
a baseline VAE, they present two models. The first aligns tence, which in addition to the semantic meaning suffice to
the latent distributions using an adversarial discriminator render it. So we have:
on the latent vector. The second aligns the latent distribu-
tions using an adversarial discriminator on the generated F1 : X1 → Z
text. They employ a variety of tricks to create a differen-
tiable text generation process. First, they use the softmax F2 : X2 → Z
output of the RNN instead of sampling a specific word. And:
Secondly, they use ”professor-forcing” (Lamb et al., 2016), G1 : Z → X1
which consists of running the discriminator network on the
G2 : Z → X2
CS 287 Final Project Template

where G(F (x)) = x. The task of the model is to learn 5.2. Backtranslation
all four of the above functions, to be able to encode from
Instead of using KL-divergence to align latent distributions,
either style into the latent space and decode from the latent
our backtranslation-based model uses a discriminator over
space to either of the styles.
the latent space. This discriminator must classify latent
We model these functions using recurrent neural networks. codes as corresponding to an encoding from one style or
the other. This gives us the objective:
ht ← RNN(es (xt ), ht−1 )
Ladv (θE , θD ) =
The final hidden state of the encoder constitutes the latent
representation. Each word of the input sentence is embed- Ex1 ∼X1 [− log (D(E(x1 , 1)))]+Ex2 ∼X2 [− log (1 − D(E(x2 , 2)))]
ded before being fed into the RNN, and the embedding is where E and D represent our encoder and discriminator,
different for each style s, which provides a signal to the en- respectively. This adversarial training can be interpreted
coder about the origin style. We use an LSTM as the RNN. game-theoretically as a minimax game.
The decoder is likewise an LSTM initialized with the latent The backtranslation also employs reconstruction loss, sim-
code as the first hidden state. In this case the hidden states ilarly to the VAE model. The key difference is the cor-
are mapped to one-hot word vectors when generating the ruption of inputs as described in the related work section.
actual tokens. Because natural language is so sparse, it is possible to en-
code every input sentence in different places in the latent
This basic setup is shared by all of our models, but the train-
space with spatial relationships that don’t correspond to
ing processes used to induce the properties we desire vary.
sentence similarity of the inputs. The noise helps to correct
The natural extension of this model of unsupervised style this problem by making sure similar sentences are close to
transfer is the introduction of latent style representations. each other in latent space, improving the generalizability of
In this model, the functions would be parameterized by the the model.
latent style vector, instead of training two separate decoders
The backtranslation model also takes advantage of a cross-
or one decoder with discrete style indicator features. If the
domain loss by transferring from one style to another and
set of style codes is S:
then back. This can be viewed as a nearly analogous train-
F : X1 ∪ X2 →< S, Z > ing procedure to the original reconstruction loss, because
the gradients are only passed through the second transfer.
G :< S, Z >→ X1 ∪ X2 For baseline calculations, we implemented this model
with the same inverse relationship holding. This is a much roughly following the parameter choices in the original pa-
more difficult problem because it is even more undercon- per. We used a 2-layer, bidirectional LSTM with hidden
strained. Unfortunately, we were not able to attempt to de- state size of 300 as our encoder. Our decoder used has 2
velop models for this setup because of a lack of time, but it layers, and sentences were generated greedily. The LSTM
may have been possible with the use of a fixed pretrained layers were shared between the source and target encoder,
sentence embedding producing the content code to do so. as well as between the source and target decoder. Only
the embeddings were changed by the specification of lan-
guage as input to the encoder and decoder. The encoder
5. Training
and decoder are trained using Adam, with a learning rate
5.1. VAE of 0.0005, and a mini-batch size of 64. Our discriminator
is a multilayer perceptron with 3 hidden layers of size 1024
We implemented a Variational Autoencoder model as a with ReLU activation functions and a sigmoid output unit.
simple baseline. We used a multivariate normal distribu- Our discriminator was trained using Adam with a learning
tion as the prior. Using the reparameterization trick, the rate of 0.0005. The encoder-decoder and the discrimina-
latent representation is that produced by the encoder with tor were trained by evenly alternating the gradient update
gaussian noise added in. Following the VAE objective, steps.
the loss function combines both a reconstruction loss and
a KL-divergence term which forces the latent distribution
5.3. Sentence Embedding Model
to be close to that of the prior. The reconstruction loss
is found using ”teacher-forcing”, in which the correct se- We were originally unconvinced, however, that the back-
quence of tokens up to a given point is used in predict- translation model would be completely effective for style
ing the next token. This process avoids using any non- transfer. For example, its adversarial training components
differentiable steps, allowing gradients to be propagated as well as its seq2seq framework makes training difficult to
backwards through the network. do. We argue that a successful model requires an encoder
CS 287 Final Project Template

that can provide rich latent presentations of sentences, in Source i would absolutely stay there again !
VAE we then think it ’s little horrible - now i can have had unk !
order for decoders to properly incorporate stylistic ele- Backtranslation model i will never go back again !
ments with minimal effort for recovering content. There- Source nice place to eat pizza and great service .
fore, we wanted to implement an encoder framework in- VAE don’t waste this way to sit and get their good food and went to say .
Backtranslation model a little annoying with the pizza , not good .
spired by skip-thought vector models for training. Source nice location and shop is professional looking .
VAE and the service i have had to mad and the reviews is going along .
The classification network we use to choose the correct Backtranslation model a little slow , and the wait was empty .
neighboring sentences given a target sentence is simply a
cosine similarity measure, which leaves all of the responsi- Table 1. Sentiment transfer on Yelp reviews- each generated sen-
bility of proper sentence embedding left for the encoder. tence is a transfer from a positive sentence to a negative one.

5.4. REINFORCE Source (Shakespeare) who was t came by ?


Backtranslation model tell tell tell tell tell with me .
We initially planned to use reinforcement learning as a way Backtranslation + SE model who came by?
Sparknotes original who was it that came here?
to train our models directly from adversarial discrimina- Source (Shakespeare) the sky doth frown and unk upon our army .
tors on transferred text. Although previous works rejected Backtranslation model dont take me to see me , and i ll have you .
the use of reinforcement learning for its notoriously high Backtranslation + SE model sad days, they are here.
Sparknotes original he sky frowns and scowls on our army.
variance in this context, we hoped that a sufficiently strong Source ( Shakespeare) all three now marry in an instant .
model trained using another method would be able to bene- Backtranslation model come on , when they re traitor .
fit from further training using REINFORCE. We coded the Backtranslation + SE model three now come marry
Sparknotes original All three of us will marry now in death.
infrastructure to use this algorithm but were never able to
produce a sufficiently strong model that we reached this Table 2. Style transfer between writing by Shakespeare and mod-
stage. ern English

6. Methods
speare sentences, with the goal of seeing whether a simple
The models were implemented using Pytorch. alternative to normal encoder-decoder training could actu-
For the sentiment modification task we use the yelp re- ally perform equally or even better.
views dataset. (yel) Following Shen et al., we consider
ratings above 3 to be attached to positive reviews and any- 7. Results
thing lower to be attached to negative reviews. We con-
sider all sentences to have the sentiment of the overall re- The results from Table 1 show our two baseline models for
view. We mitigate this problem by dropping longer sen- the sentiment transfer task, as a proof of concept to show
tences and longer reviews overall, which they argue cor- how machine translation models perform well. Indeed, as a
relate with increased use of sentiment-neutral background replication of past reports, we can see that the VAE model
sentences. The final dataset has 250K negative sentences did not perform very well, because it was unable to capture
and 350K positive ones. the content of the sentence and also struggled to generate
complete sentences.
When training our sentence embedding model, we used the
Children’s Book Test dataset (Hill et al., 2015) and the ”All The backtranslation model did noticeably better, as it was
the news” Kaggle dataset (new). Combined, these datasets able to preserve the content and correctly change key words
have approximately a million sentences. Unfortunately, we to transfer sentiment. However, the last sentence is an ex-
were unable to use this trained encoder for our experiments, ample of where the model did not do a very good job.
because we ran out of time. In the Shakespeare task in Table 2, we can actually see that
We used cross entropy loss to measure reconstruction loss. our pretrained backtranslation model with frozen encoder
A major issue with style transfer is the lack of clear met- weights was most effective, although it still struggled with
rics because for unsupervised datasets there are no labels to preserving content.
compare with directly and pre-trained classifiers may make
mistakes in similar ways to generators trained adversari- 8. Discussion
ally with classifiers. Instead, we use supervised data, a col-
lection of Shakespeare sentences and their modern English None of our models produced satisfactory performance on
equivalents. We can then compare transferred examples to this task. Perhaps informatively, though, each failed in
the actual equivalents as created by a human. We were able slightly different ways. The naive Variational Autoencoder
to train our sentence embedding model on these Shake- model, on the sentiment modification task, frequently pro-
duced transferred sentences that reflected the target style
CS 287 Final Project Template

but completely abandoned any semantic resemblance to the books with explicit memory representations. arXiv
original sentence. This hints that the VAE architecture is preprint arXiv:1511.02301, 2015.
not capable of aligning the two latent distributions. This
could reflect something relatively simple, such as the la- Lamb, Alex M, GOYAL, Anirudh Goyal ALIAS PARTH,
tent representation of one style being a rotation of the la- Zhang, Ying, Zhang, Saizheng, Courville, Aaron C, and
tent style representation of another, or it may indicate some Bengio, Yoshua. Professor forcing: A new algorithm
deeper failure mode. The VAE learns to perfectly recon- for training recurrent networks. In Advances In Neural
struct input sentences that are decoded from the latent space Information Processing Systems, pp. 4601–4609, 2016.
to the same style. Lample, Guillaume, Denoyer, Ludovic, and Ranzato,
The backtranslation produced qualitatively better results, Marc’Aurelio. Unsupervised machine translation
but suffered in general from a similar inability to preserve using monolingual corpora only. arXiv preprint
semantic meaning. arXiv:1711.00043, 2017.
We argue that sentiment transfer is a very simplified form Logeswaran, Lajanugen and Lee, Honglak. An efficient
of general style transfer, and in many ways cannot be seen framework for learning sentence representations. arXiv
as style transfer in the literary sense at all. Therefore, we preprint arXiv:1803.02893, 2018.
speculated that the backtranslation model, although it per-
formed acceptably in sentiment transfer, would not be able Shen, Tianxiao, Lei, Tao, Barzilay, Regina, and Jaakkola,
to capture other forms of style transfer effectively. Tommi. Style transfer from non-parallel text by cross-
alignment. In Advances in Neural Information Process-
ing Systems, pp. 6833–6844, 2017.
9. Conclusion
Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and
Ultimately, our models failed to produce meaningful re- Manzagol, Pierre-Antoine. Extracting and composing
sults. As noted in our discussion section, none of our mod- robust features with denoising autoencoders. In Proceed-
els truly succeeded in preserving semantic meaning across ings of the 25th international conference on Machine
style transfer. There are several potential reasons for this. learning, pp. 1096–1103. ACM, 2008.
First, it is possible that our models had insufficiently pro-
longed training when compared to their machine transla- Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, and Efros,
tion inspirations. It is also possible that some of our as- Alexei A. Unpaired image-to-image translation using
sumptions, such as a shared content distribution, were vio- cycle-consistent adversarial networks. arXiv preprint
lated by the data we used, subverting our goals. arXiv:1703.10593, 2017.
Whatever the case, our models were not capable of align-
ing the latent representations. The first continuation of our
work would be further experimentation with and training of
the skip-thought sentence encoding model. Such a model
might completely sidestep the latent space alignment prob-
lem. The logical next step from models using discrete style
features is continuous representations of style. With a fixed
content encoder, one could use an adversarial discriminator
to disentangle the style representation.
Unsupervised style transfer for natural language remains an
open problem, with everything from the alignment problem
to the absence of accepted metrics lacking clear solutions.

References
All the news. URL
https://www.kaggle.com/snapcrack/all-the-news.

Yelp dataset. URL https://www.kaggle.com/yelp-dataset/yelp-dataset.

Hill, Felix, Bordes, Antoine, Chopra, Sumit, and Weston,


Jason. The goldilocks principle: Reading children’s

Das könnte Ihnen auch gefallen