Sie sind auf Seite 1von 8

Sentence Generation From Relation Tuples

Guoxing Li
Stanford University
Stanford, CA 94305
guoxing@stanford.edu

Nathan Eidelson
Stanford University
Stanford, CA 94305

Nishith Khandwala
Stanford University
Stanford, CA 94305

nathanje@stanford.edu

nishith@stanford.edu

Abstract
How can machines communicate with human in a more natural way? To answer this
question, a machine first needs to express
its knowledge in a human friendly format,
namely natural language. In this paper, we
present a system that is able to generate
human friendly sentences based on structured knowledge, specifically relationship
tuples. The model adopts an unsupervised
approach, and takes advantage of a powerful bi-directional recurrent neural network language model and GloVe vectors
for word representation to generate sentences that conveys the same knowledge
as the original relationship tuples. Our experiments show that the performance of
our system is still suboptimal. However, it
should serve as an initial attempt to solve
the task of expressing machine knowledge
in natural languages.

Introduction

Large online databases such as Freebase1 consolidate vast amounts of information, crowd-sourced
and vetted by active contributors. The data is
exposed to external developers through well
documented APIs. However, it is not intended
for the general populus consumption in the same
way Wikipedia2 articles are. The Freebase data
is stored in complex nested MQL formats that
extend several layers deep and link to other
objects in the database. Wikipedia articles are
easily understood by humans, while Freebase is
easily understood by machines. Not only is information ultimately duplicated between these two
platforms, but crowd-sourcing efforts are doubled
in their maintenance. We hypothesize that the
1
2

https://www.freebase.com
https://www.wikipedia.org/

human-readable summaries could be generated


from the machine-readable data in Freebase, and
this is the fundamental challenge which this paper
addresses. The question then arises: how might
these two platforms of information aggregation be
consolidated?
The rest of the paper is structured as follows.
In Section 2, we present the motivation of our
work. Related work is discussed in Section 3. We
present our system in details in Section 4, which
contains relationship tuple generation, a greedy
searching algorithm, and a bi-directional recurrent
neural network language model. Experimental
results and analysis are shown in Section 5.
Section 6 discusses the future work for our task.

Motivation

The motivations for pursuing this project can be


described by highlighting several inspirations.
First, we noticed early in our research that information extraction has made great progress in
the past decade. Projects such as OpenIE have
enabled the rapid parsing of meaning from large
corpi. However, relation extraction is a relatively
new field of NLU and almost all research is
dedicated to extracting tuples from text. Current
research neglects the reverse direction: generating
well-formed and human readable sentences from
relation tuples. We believe this to be important
because relation tuples, while convenient for data
storage, are not the optimal format for human
consumption.
Additionally, we were inspired by the myriad of factual data available through Freebase
and surprised by the difficulty of accessing it.
Even as topics appear on the Freebase website,
their most important information is concealed by
miscellaneous and obscure data. The database is
not designed for exploration in the same way that

Wikipedia is, but it seems to contain much of the


same encyclopedic information. We thus have had
two sources of inspiration: the novel challenge of
generating sentences from relation tuples, and the
desire to present the data of Freebase in a more
digestible format. The amalgamation of these two
ideas ultimately motivated our project of sentence
generation from Freebase tuples.

Related Work

As mentioned in the previous section, we are


trying to go from a formal description of factual
knowledge to human readable sentence representations. Going through literary works, we came to
a realization that this task of generating sentences
from relation tuples is not a particularly common
area of research. The combination of these two
disparate yet powerful technologies, Information
Extraction and Natural Language Generation
(NLG), remains to be explored.
Yet, here, we explore some related approaches
in these individual fields. In a paper (Langkilde
and Knight, 1998), the synergy between symbolic
language processing and bigram frequency statistics (n-grams) is examined. The authors make
use of bigrams for more accurate sentence generation. Another study (Iyyer et al., ) attempted
to construct coherent and grammatically correct
sentences from a given single word vector using
dependency parse trees, a type of model which
could be useful for enhancing the quality of our
output sentences.
Moving on from the traditional methods in
use for NLG, we also referred to papers that exploited the power of Recurrent Neural Networks
(RNNs) for sequence mapping. In particular,
a study (Sutskever et al., 2011) demonstrated
the ability of RNNs to correctly predict the next
character in a sequence by introducing a new
variant that uses gated units. This modification
enabled easy training of RNNs, contrary to what
was popular belief then. Our faith in the use
of RNNs was further strengthened (Mikolov
and Zweig, 2012) when a new state-of-the art
performance was achieved on the Penn Treebank
dataset using a RNN variant. Finally, we referred
to a paper (Karpathy and Fei-Fei, 2014) that generates natural language descriptions of different
regions of images as well as complete sentence

Figure 1: An overview of our overall methodology


description for on full images.

4
4.1

Approach
Relation Extraction and Data Generation

First, we had to obtain a set of tuples to use as


the starting point of natural language generation.
These relation tuples needed to be factually
correct, yet sparse enough to allow for flexibility
in their syntactic representation. For instance,
the tuple (Barack Obama, children, Malia Ann
Obama) would successfully define the appropriate
relation, while allowing for the paraphrasing
of children into varying denotations such as
daughter, child, or offspring.
We began this project fairly confident that
we would employ OpenIE to extract tuples from
input sentences. This approach had the added
benefit of straightforward evaluation, because the
input sentences could be retained and compared
to the generated sentences. We eventually faced
problems with OpenIE, however, because the
relation tuples were too verbose; in most cases the
arguments concatenated with the relation resulted
in the original sentence. For the sake of pursuing
a more challenging and interesting project we
began researching other options.
We eventually settled on extracting relations
from Freebase, an open database of structured
MQL knowledge. The advantages are twofold:
more sparse relations which are still factually
accurate, and the noble objective of increasing access to an existing and well maintained database.
One disadvantage is the necessity of scraping
Freebase responses to construct well-formed
tuples. The Google Freebase API enables us
to input a topic, say Barack Obama, as a query
and outputs a list of relations in the following
structured format:

Figure 2: The Sentence Generation Model


We convert this structure into a more usable format
that will then serve as a primitive generated sentence for our model. For this, we developed script
which: takes a search query, determines the corresponding freebase topic using the Search API,
performs a Topic API request with the topic id,
and ultimately parses the response to form tuples.
We then remove any tuples that wouldnt appear
well in sentences but still exist in Freebase, such
as image links or timestamp values. In the above
case, we get two tuples of the form:
(Barack Obama, children, Natasha Obama)
(Barack Obama, children, Malia Ann Obama)
Looking at these examples, we see that the tuples
already assume some kind of sentence structure
and Part-Of-Speech tagging. The first and last
elements of the tuples are usually noun phrases
and the remaining tend to represent an interaction
or relation between the two.
The next step in our approach is to expand
the topics tuples, by querying the Paraphrase
Database (PPDB) for each tuples relation phrase.
We append the found paraphrases to the tuples
relation phrase, indicating the various forms
that the relation can take. This often results in
a significant expansion of the original tuples,
from which our sentence generation algorithm
can better select the most grammatically correct
representation.
We can essentially consider these tuples to
be ill-formed sentences that are not only grammatically incorrect but also lack a general flow.
Our primary task, now, is to take these sentences
as input and generate a coherent sentence conveying the same information, the procedure for which
is detailed below.
To summarize our data processing step, we

take query, X, treat it as an input into Freebase,


generate relation tuples and augment the size
of this obtained dataset using PPDB for further
work.
(X) (Freebase Format) PPDB
(X, r 1, A 1) (X, r 1, A m)
(X, r 2, B 1) (X, r 2, B n)

..
..
.
.

.
.
.
(X, r k, Z 1)

(X, r k, Z p)

where ri is a list of relation terms augmented by


PPDB, k is the number of distinct relations and
(A 1, ..., A m), ... , (Z 1, ..., Z p) represent the
objects (acted upon) in tuples obtained from Freebase.
4.2

Sentence Generation

Considering that the relation tuples already


contain the major information of a sentence, we
decided to fill in words to make the information to
be human readable, i.e. grammatical. One way to
address this problem is to directly utilize language
modeling. We can think of the task as a sequential
generating process with fixed beginning, middle,
and end. Theoretically there could be an arbitrary
number of filling words in between. To make the
searching problem tractable, we limit the filling
words up to two both before and after the relation.
This is a reasonable choice since filling words
always tend to be short, and we dont want to
introduce extra information beyond the scope of
the relation tuple that the sentence is based on.
We then generate sentences on each fixed number
of filling words (there are 9 cases in total), and
pick the ones with the lowest perplexity among all
those sentences.
Now our task is simplified to generating sentences with the lowest perplexity with fixed
number of filling words. The task further boils

down to a search problem in which the goal is


to find the right words to fill the slots that can
generate the lowest perplexity. A brute force
searching approach would be O(N k ), where N
is the size of the vocabulary, and k is the number
of filling words. Considering that k could be
up to 4, and N could easily be over 10,000, its
not feasible to run a brute force approach. We
therefore adopt a greedy approach. Suppose we
are interested in getting the top m sentences with
the lowest perplexity. The algorithm fills the left
missing part first then the right part. When its
filling the current slot, it treats the sentence as a
complete one except the current missing word,
even though there are more words missing. It then
uses language model to generate m candidate
words. If there are more empty slots, the algorithm attaches the candidate word to left or right
and continues searching words for other slots.
Algorithm 1 illustrates the general structure of the
greedy algorithm, where a1, r, a2 are argument
1, relation, and argument 2 respectively, lk is the
number of words to be filled between a1 and r, rk
is the number of words to be filled between r and
a2, and m is the desired number of sentences. The
time complexity of this approach is O(mkN ).
Furthermore, we notice that some relationships
in Freebase are not proper to appear in a sentence
as is. Based on the results of our system, we
may need to consider replacing them with paraphrases by using databases like PPDB. Again, in
such cases, we generate a set of candidate sentences using different relation phrases from PPDB,
and pick the ones with the lowest perplexity.
4.3

Algorithm 1 Greedy searching algorithm to fill


words in fixed number of slots
function G EN M IDDLE C ANDIDATES(l,r,k,m)
cands Queue()
cands.APPEND(([],[]))
while Not all cands have k words do
lcand, rcand cands.POP()
words
G EN W ORD(l
+
lcand, rcand + r, m)
for word in words do
cands.APPEND((lcand
+
word, rcand))
cands.APPEND((lcand, word
+
rcand))
end for
end while
cands M ERGE A LL(cands)
cands T OP R ANK(cands,m)
return cands
end function
function G EN S ENTENCES(a1,r,a2,lk,rk,m)
seqs List()
lmissings
G EN M IDDLE C ANDI DATES (a1,r + a2,lk,m)
for lmissing in lmissings do
rmissings G EN M IDDLE C ANDI DATES (a1 + lmissing + r,a2,rk,m)
for rmissing in rmissings do
seqs.APPEND(a1 + lmissing + r +
rmissing + a2)
end for
end for
seqs T OP R ANK(seqs,m)
return seqs
end function

Bi-directional Recurrent Neural Network

Recurrent neural network (RNN) is a type of network where connections between units form a directed cycle. In other words, the hidden layer
from a previous timestamp is combined with input layer to compute the hidden layer at the current timestamp. RNNs have shown great success
in a range of natural language processing tasks,
especially in language modeling (Mikolov et al.,
2010), (Mikolov and Zweig, 2012). However,
such RNNs only have information in the past when
making a decision on xt , which is not sufficient
in the word filling task since often times the context afterwards (especially the immediate following word) also determines the words to be filled
in. One way to resolve this issue is to use a bi-

directional RNN (BRNN) (Schuster and Paliwal,


1997). Our RNN model is based on (Mikolov et
al., 2010). We add bi-directionality to it, and formulate it as follows:

ht = sigmoid( L xt + V h t1 )

(1)

ht = sigmoid( L xt + V h t+1 )

(2)

yt = softmax( U h t + U h t )

(3)

where the parameter with right arrow represent the


forward pass, the ones with left arrow represent
the backward pass, L is the word-representation
matrix, with each column Li as the vector for a
particular word i, h is the hidden layer, yt is the

probability of all words in the vocabulary to appear at slot t. Figure 3 demonstrates our model.
The BRNN is a core part of our sentence generation system. It is able to predict one missing word
based on the context around the word, which is an
essential building piece in the sentence generation
process. Note that in our greedy search algorithm,
the BRNN model does suffer from accuracy issue
since the context provided is not complete most of
the time. However, we think its a fair trade-off in
gain of much more efficiency.

Dr. Bill MacCartney, NLTK is a predictive model


with its own limitations and so, its judgements
could be fallible. More importantly, grammatical
correctness is a necessary but not a sufficient
condition of success. A good output also must
possess a logic flow that aligns with that conveyed
by the input relation tuple, and sound natural.
As an example, Colorless green ideas swim
furiously is grammatically correct, but it is
unintelligible.
Extrapolating from here, we introduce the
second step - manual evaluation. Given some
randomly chosen relation tuples, we generated
corresponding sentences and manually judged
the fidelity and grammatical soundness of the
output. If the generated sentence made sense and
the grammar was intact within the scope of the
relation phrase, we assigned a score of 1 to the
relation tuple. If not, the score for the input was 0.

Figure 3: A schematic representation of a bidirectional RNN

Experiments and Results

In this section, we introduce our experiment settings, show experimental results of our system,
and then analyze them.
5.1

4.4

Evaluation Metric

As an unsupervised learning problem, designing


an evaluation metric becomes tricky. Had a gold
standard been present, the metric for judging our
model would essentially boil down to comparing
the generated and the reference sentences - word
for word or using a n-gram scoring method such
as the BLEU (Bilingual Evaluation Understudy).
So, we defined a two-stage evaluation criteria for our project. Keeping in mind that our
goal is to generate well-formed sentences, the
first step involves verifying that the sentence
is grammatically correct. While this test could
be performed manually, the Natural Language
Toolkit (NLTK3 ) helps automate the process.
NLTK boasts a list of context free grammars that
have been extracted and built from giant corpora.
If a generated sentence yields a parse using this
grammars, it has valid grammar. However, basing
our model results off the output from such a
syntactic parser is not very satisfying as a sound
evaluation method. As highlighted by our mentor,
3

http://www.nltk.org/

Datasets

As mentioned in Section 4.4, our system is mainly


evaluated manually. In order to compare the effectiveness of different parts or settings of our system, we limit the test set to 20 relationship tuples
randomly selected from Freebase with the arguments ranging from places, people, to organization, which is feasible for manual evaluation. The
training corpus we chose for our BRNN language
model is the Penn Treebank dataset (PTB4 ), which
is a database of text derived from the Wall Street
Journal. We discard the tree structure and only use
the sentence level text as our training dataset. The
dataset contains 56,522 sentences in total.
5.2

Word Vectors

Based on the original recurrent neural network


language model paper (Mikolov et al., 2010),
word vectors are initialized randomly and then
trained together with the model so that they fit better with the model. This is certainly a great approach to achieve better performance for the language model. However, the model does suffer
from a long training time problem. A model of
4

https://www.cis.upenn.edu/ treebank/

only 8,000 words in the vocabulary takes around


15 hours to train on one epoch of the PTB dataset
on a Macbook Pro with a 2.7GHz processor. Considering that most of the relationship tuples contain entity names which probably never appear
in the PTB dataset, partially because the PTB
dataset was generated in the 90s, our model will
not perform well since different entity names, regardless of their types, will all be matched to the
unknown word UUUNKKK. However, we notice that most of the filling words are all common
words like prepositions of, to, which means the
set of words need to be generated is small. We
therefore consider to take advantage of GloVe vectors (Pennington et al., 2014) for word representation. In our system, we first pick 8000 words that
appear most frequently in the PTB dataset to use
as a vocabulary used by our RNNLM. We initialize the word vectors with GloVe if the word exists
in the GloVe vocabulary, or randomly if it does
not. After training the language model, on prediction we use the word vector trained if a word exist
in the model vocabulary, or the word vector from
GloVe if it is not in. Its worth mentioning that
the words that appear frequently will be pushed
away from its original position in GloVe, which
makes those words apart from similar words that
appear less frequently. However, we believe the
error is tolerable. One way to maintain the original word vector distribution from GloVe is to not
update word vectors at all during training, i.e. treat
them as constant. Nevertheless, its less effective
in our bi-directional RNNLM (BRNNLM) since
there are two word vectors to represent words in
different directions, and a unified word vector representation will degrade the performance of the
language model.
5.3

Training

Though our system is unsupervised, we do require


a pre-trained language model to plug into our system. The corpus we used to train our BRNNLM
is the Penn Treebank dataset. The tokenization
is done already by the dataset. We transform every digit to DG, select 8000 words that appear
most frequently from the corpus as our vocabulary, and map the words out side of vocabulary as
the unknown word UUUNKKK. We use stochastic gradient descent of learning rate 0.1 as the optimization method. We trained the model for one
epoch.

Table 1 Accuracy comparison between system


with PPDB and system without PPDB
Without PPDB With PPDB
Top-1
0.30
0.35
Top-10 0.45
0.65

5.4

Results

Quantitative: Table 1 shows our experimental results. There are two metrics we used, top-1 sentence accuracy rate and top-10 sentences accuracy rate. Top-n sentences accuracy rate means
the number of true positives over all test examples. We count an example as true positive if at
least one correct sentence appear in the top n sentences generated from our system ranked by average perplexity. Looking from the results, we
can see that the system with PPDB integrated performs better than the one without PPDB. This validates our hypothesis. However, in general, our
system didnt perform as well as we expected. We
think there are three main reasons. First, the language model trained is using complete sentences
as the training corpus. However, during sentence
generation, we have to force the model to predict
missing words based on half-complete sentences
to limit the searching space. Second, the training corpus is both small (56,522 sentences) and
improper for our task. The corpus is generated
from Wall Street Journal articles. Though its descriptive, we think a better option would be the
Wikipedia corpus, since its knowledge based corpus and usually contains the right format for sentences about knowledge. At last, we believe the
model will perform better if we have a larger vocabulary due to the error introduced by training on
a small set of words as mentioned in Section 5.2.
Qualitative: Table 2 lists the top 10 generated sentences based on the relation tuple (Larry
Page, board member, Google). Note that the
words Larry, Page, Google were not included
in our 8,000 model vocabulary during training.
Our system is able to recognize Larry Page and
Google as entities to fill the proper words. Since
board is not a frequent word in the corpus and it
happens to appear together with big a lot. Our
model gives higher rank to the phrase big board.
We believe the problem can be allieviated by having a larger training corpus.

Table 2 Top 10 sentences generated by our system with PPDB


Sentence
Average Loss (per word)
larry page the big board member of google .
3.1549521927
larry page a big board member of google .
3.2161115851
larry page s big board member of google .
3.35368420058
larry page an big board member of google .
3.45527603069
larry page , the board member of google .
3.46242553423
larry page the big board member for google . 3.46760170839
larry page the big board member and google . 3.50093685973
larry page is the board member of google .
3.54004893268
larry page the big board member in google .
3.54992127647
larry page the big board member on google .
3.55370274061

Future Work

Relation extraction should be used to further natural language generation. Our results are indicative
that tuples formed as a result of information
extraction methods reinforce the fidelity of the
generated sentences. Here, in this section, we
enlist a few variations in our methodology we
would have liked to iterate over.

Natasha is the child of Obama and not vice-versa.


Alternatively, Barack Obama and Natahsha
Obama could exchange their positions in the
tuple to reflect this relation property. In summary,
our results have potential to further improve if
each tuple leads to 6 more permutations and
then allowing the perplexity score to rank all the
candidate sentences.

7
A bigger RNN model trained for a longer
time over a more exhaustive dataset would have
definitely helped improve tuples containing
proper nouns. As of now, our training set does not
keep up with current updates and hence, lacks the
names of people and places often mentioned in
our input tuples. We believe that a larger training
size would resolve this issue.
We were also considering making a model
learn on the PPDB. Currently, our approach performs a linear search to find relevant paraphrases.
In order to boost our performance, it would be
interesting to investigate how a neural network
trained on the PPDB generates paraphrases in
comparison to our existing approach.
Lastly, our current sentence generation model
could be modified to further augment the number
of input tuples. Given a relation tuple, the order of
the three constituents could be permuted. While
evaluating our results, we noticed that better sentences would have been generated if the two noun
phrases in the tuple had been switched in position.
For instance, the tuple (Barack Obama, children, Natasha Obama) will probably not result
in a well-formed sentences unless the relation
phrase is modified to incorporate the fact that

Conclusion

In this project, we attempted to generate wellformed, logically sound and grammatically accurate sentences given relation tuples that usually
take the form, (Noun Phrase 1, Relation Phrase,
Noun Phrase 2). Our model comprised of a relation extraction engine that collected such tuples and augmented their number, a bi-directional
RNN that took these tuples as input and generated
words to add before and after the relation phrase,
a scoring metric called the perplexity to rank the
candidate outputs, and a greedy searching algorithm to limit the searching scope. Upon manual evaluation of our results, our model obtained
an accuracy of 0.30 without PPDB and 0.35 with
PPDB only considering the best generated sentence. Evaluating on the top 10 sentences for each
input tuple, the accuracies were higher with 0.45
without PPDB and 0.65 with PPDB. We believe
our work is an initial attempt to generate human
readable sentences based on relationship tuples,
and theres still a good amount improvement can
be done based on our work.

Acknowledgements

We would like to thank Professors Bill MacCartney and Chris Potts for being immensely supportive of our project.

References
Mohit Iyyer, Jordan Boyd-Graber, and Hal Daume III.
Generating sentences from semantic vector space
representations.
Andrej Karpathy and Li Fei-Fei. 2014. Deep visualsemantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306.
Irene Langkilde and Kevin Knight. 1998. The practical value of n-grams in generation. In Proceedings
of the ninth international workshop on natural language generation, pages 248255. Citeseer.
Tomas Mikolov and Geoffrey Zweig. 2012. Context
dependent recurrent neural network language model.
In SLT, pages 234239.
Tomas Mikolov, Martin Karafiat, Lukas Burget, Jan
Cernock`y, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In INTERSPEECH 2010, 11th Annual Conference of the
International Speech Communication Association,
Makuhari, Chiba, Japan, September 26-30, 2010,
pages 10451048.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning.
2014.
Glove: Global vectors for
word representation. Proceedings of the Empiricial
Methods in Natural Language Processing (EMNLP
2014), 12:15321543.
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. Signal Processing,
IEEE Transactions on, 45(11):26732681.
Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural
networks. In Proceedings of the 28th International
Conference on Machine Learning (ICML-11), pages
10171024.

Das könnte Ihnen auch gefallen