Beruflich Dokumente
Kultur Dokumente
Ankit Kumar, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong,
Romain Paulus, Richard Socher
firstname@metamind.io, MetaMind, Palo Alto, CA USA
arXiv:1506.07285v5 [cs.CL] 5 Mar 2016
Figure 3. Real example of an input list of sentences and the attention gates that are triggered by a specific question from the bAbI tasks
(Weston et al., 2015a). Gate values gti are shown above the corresponding vectors. The gates change with each search over inputs. We
do not draw connections for gates that are close to zero. Note that the second iteration has wrongly placed some weight in sentence 2,
which makes some intuitive sense, as sentence 2 is another place John had been.
of TQ words, hidden states for the question encoder at time Need for Multiple Episodes: The iterative nature of this
t is given by qt = GRU (L[wtQ ], qt1 ), L represents the module allows it to attend to different inputs during each
word embedding matrix as in the previous section and wtQ pass. It also allows for a type of transitive inference, since
represents the word index of the tth word in the question. the first pass may uncover the need to retrieve additional
We share the word embedding matrix across the input mod- facts. For instance, in the example in Fig. 3, we are asked
ule and the question module. Unlike the input module, the Where is the football? In the first iteration, the model ought
question module produces as output the final hidden state attend to sentence 7 (John put down the football.), as the
of the recurrent network encoder: q = qTQ . question asks about the football. Only once the model sees
that John is relevant can it reason that the second iteration
2.3. Episodic Memory Module should retrieve where John was. Similarly, a second pass
may help for sentiment analysis as we show in the experi-
The episodic memory module iterates over representations ments section below.
outputted by the input module, while updating its internal
episodic memory. In its general form, the episodic memory Attention Mechanism: In our work, we use a gating func-
module is comprised of an attention mechanism as well as tion as our attention mechanism. For each pass i, the
a recurrent network with which it updates its memory. Dur- mechanism takes as input a candidate fact ct , a previ-
ing each iteration, the attention mechanism attends over the ous memory mi1 , and the question q to compute a gate:
fact representations c while taking into consideration the gti = G(ct , mi1 , q).
question representation q and the previous memory mi1 The scoring function G takes as input the feature set
to produce an episode ei . z(c, m, q) and produces a scalar score. We first define a
The episode is then used, alongside the previous mem- large feature vector that captures a variety of similarities
ories mi1 , to update the episodic memory mi = between input, memory and question vectors: z(c, m, q) =
GRU (ei , mi1 ). The initial state of this GRU is initialized
to the question vector itself: m0 = q. For some tasks, it h i
is beneficial for episodic memory module to take multiple c, m, q, c q, c m, |c q|, |c m|, cT W (b) q, cT W (b) m ,
passes over the input. After TM passes, the final memory
(5)
mTM is given to the answer module.
where is the element-wise product. The function
G is a simple two-layer feed forward neural network
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
et al., 2014). The work of recent months by Weston et Sentiment analysis is a very useful classification task and
al. on memory networks (Weston et al., 2015b) focuses recently the Stanford Sentiment Treebank (Socher et al.,
on adding a memory component for natural language ques- 2013) has become a standard benchmark dataset. Kim
tion answering. They have an input (I) and response (R) (Kim, 2014) reports the previous state-of-the-art result
component and their generalization (G) and output feature based on a convolutional neural network that uses multi-
map (O) components have some functional overlap with ple word vector representations. The previous best model
our episodic memory. However, the Memory Network can- for part-of-speech tagging on the Wall Street Journal sec-
not be applied to the same variety of NLP tasks since it tion of the Penn Tree Bank (Marcus et al., 1993) was So-
processes sentences independently and not via a sequence gaard (Sgaard, 2011) who used a semisupervised nearest
model. It requires bag of n-gram vector features as well neighbor approach. We also directly compare to paragraph
as a separate feature that captures whether a sentence came vectors by (Le & Mikolov., 2014).
before another one.
Neuroscience: The episodic memory in humans stores
Various other neural memory or attention architectures specific experiences in their spatial and temporal context.
have recently been proposed for algorithmic problems For instance, it might contain the first memory somebody
(Joulin & Mikolov, 2015; Kaiser & Sutskever, 2015), cap- has of flying a hang glider. Eichenbaum and Cohen have ar-
tion generation for images (Malinowski & Fritz, 2014; gued that episodic memories represent a form of relation-
Chen & Zitnick, 2014), visual question answering (Yang ship (i.e., relations between spatial, sensory and temporal
et al., 2015) or other NLP problems and datasets (Hermann information) and that the hippocampus is responsible for
et al., 2015). general relational learning (Eichenbaum & Cohen, 2004).
Interestingly, it also appears that the hippocampus is active
In contrast, the DMN employs neural sequence models for
during transitive inference (Heckers et al., 2004), and dis-
input representation, attention, and response mechanisms,
ruption of the hippocampus impairs this ability (Dusek &
thereby naturally capturing position and temporality. As a
Eichenbaum, 1997).
result, the DMN is directly applicable to a broader range
of applications without feature engineering. We compare The episodic memory module in the DMN is inspired by
directly to Memory Networks on the bAbI dataset (Weston these findings. It retrieves specific temporal states that
et al., 2015a). are related to or triggered by a question. Furthermore,
we found that the GRU in this module was able to do
NLP Applications: The DMN is a general model which
some transitive inference over the simple facts in the bAbI
we apply to several NLP problems. We compare to what,
dataset. This module also has similarities to the Temporal
to the best of our knowledge, is the current state-of-the-art
Context Model (Howard & Kahana, 2002) and its Bayesian
method for each task.
extensions (Socher et al., 2009) which were developed to
There are many different approaches to question answer- analyze human behavior in word recall experiments.
ing: some build large knowledge bases (KBs) with open in-
formation extraction systems (Yates et al., 2007), some use 4. Experiments
neural networks, dependency trees and KBs (Bordes et al.,
2012), others only sentences (Iyyer et al., 2014). A lot of We include experiments on question answering, part-of-
other approaches exist. When QA systems do not produce speech tagging, and sentiment analysis. The model is
the right answer, it is often unclear if it is because they trained independently for each problem, while the archi-
do not have access to the facts, cannot reason over them tecture remains the same except for the answer module and
or have never seen this type of question or phenomenon. input fact subsampling (words vs sentences). The answer
Most QA dataset only have a few hundred questions and module, as described in Section 2.4, is triggered either once
answers but require complex reasoning. They can hence at the end or for each token.
not be solved by models that have to learn purely from ex-
For all datasets we used either the official train, devel-
amples. While synthetic datasets (Weston et al., 2015a)
opment, test splits or if no development set was defined,
have problems and can often be solved easily with manual
we used 10% of the training set for development. Hyper-
feature engineering, they let us disentangle failure modes
parameter tuning and model selection (with early stopping)
of models and understand necessary QA capabilities. They
is done on the development set. The DMN is trained via
are useful for analyzing models that attempt to learn every-
backpropagation and Adam (Kingma & Ba, 2014). We
thing and do not rely on external features like coreference,
employ L2 regularization, and dropout on the word em-
POS, parsing, logical rules, etc. The DMN is such a model.
beddings. Word vectors are pre-trained using GloVe (Pen-
Another related model by Andreas et al. (2016) combines
nington et al., 2014).
neural and logical reasoning for question answering over
knowledge bases and visual question answering.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
Table 5. An example of what the DMN focuses on during each episode on a real query in the bAbI task. Darker colors mean that the
attention weight is higher.
1-iter DMN (pred: negative, ans: positive) 1-iter DMN (pred: negative, ans: positive)
1 1
filme
rts
co out
un peteas
rem nt
ark but
le
gra an...
du d
groally
som i ws
e nto
co thing
ra f
po ble
r
.
ide o
we
Th
d
ing
In
rag its
ch ,
un a p
ass nd
y
,
the
wo e
rks
.
ab
sta
wa
vi
ge
ea
mo
um
ns
2-iter DMN (pred: positive, ans: positive) 2-iter DMN (pred: positive, ans: positive)
2 2
1
1 filme
rts
co out
un peteas
rem nt
ark but
le
gra an...
du d
groally
som i ws
e nto
co thing
ra f
po ble
r
.
ide o
we
Th
ab
sta
In
rag its
d
ch ,
un a p
ass nd
ing
y
,
the
wo e
rks
.
wa
vi
ge
ea
mo
um
ns
1-iter DMN (pred: very positive, ans: negative) 1-iter DMN (pred: positive, ans: negative)
1 1
be e
wa t
y
ho to
pe
for
ch any
en oe
joy f
thig
films
low bis
ex eriny
pe y g
tio r
ns
.
s
cta ou
Th
res My
e
to
the
film
is
de st
d
luk as
arm
.
in
an
ns
ibe
b e
po
ew
2-iter DMN (pred: negative, ans: negative) scr
2-iter DMN (pred: negative, ans: negative)
2 2
1
1
be e
wa t
y
o
pe
en oe
for
ch any
joy f
thig
films
low b s
ex riny
pe y g
tio r
ns
.
s
cta ou
i
t
Th
c
in
an
ho
res My
e
to
the
film
is
de est
d
luk as
arm
.
e
ns
ibe
b
po
ew
scr
Figure 4. Attention weights for sentiment examples that were Figure 5. These sentence demonstrate cases where initially posi-
only labeled correctly by a DMN with two episodes. The y-axis tive words lost their importance after the entire sentence context
shows the episode number. This sentence demonstrates a case became clear either through a contrastive conjunction (but) or a
where the ability to iterate allows the DMN to sharply focus on modified action best described.
relevant words.
5. Conclusion
ule to perform multiple passes over the data is beneficial. It The DMN model is a potentially general architecture for a
provides significant benefits on harder bAbI tasks, which variety of NLP applications, including classification, ques-
require reasoning over several pieces of information or tion answering and sequence modeling. A single architec-
transitive reasoning. Increasing the number of passes also ture is a first step towards a single joint model for multi-
slightly improves the performance on sentiment analysis, ple NLP problems. The DMN is trained end-to-end with
though the difference is not as significant. We did not at- one, albeit complex, objective function. Future work will
tempt more iterations for sentiment analysis as the model explore additional tasks, larger multi-task models and mul-
struggles with overfitting with three passes. timodal inputs and questions.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
Cho, K., van Merrienboer, B., Bahdanau, D., and Ben- Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A con-
gio, Y. On the properties of neural machine translation: volutional neural network for modelling sentences. In
Encoder-decoder approaches. CoRR, abs/1409.1259, ACL, 2014.
2014a.
Karpathy, A. and Fei-Fei, L. Deep visual-semantic align-
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., ments for generating image descriptions. In CVPR,
Bougares, F., Schwenk, H., and Bengio, Y. Learning 2015.
Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation. In EMNLP, 2014b. Kim, Y. Convolutional neural networks for sentence clas-
sification. In EMNLP, 2014.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empiri-
cal evaluation of gated recurrent neural networks on se- Kingma, P. and Ba, Jimmy. Adam: A method for stochastic
quence modeling. CoRR, abs/1412.3555, 2014. optimization. CoRR, abs/1412.6980, 2014.
Dusek, J. A. and Eichenbaum, H. The hippocampus and Le, Q.V. and Mikolov., T. Distributed representations of
memory for orderly stimulusrelations. Proceedings of sentences and documents. In ICML, 2014.
the National Academy of Sciences, 94(13):71097114,
1997. Malinowski, M. and Fritz, M. A Multi-World Approach to
Question Answering about Real-World Scenes based on
Eichenbaum, H. and Cohen, N. J. From Conditioning to Uncertain Input. In NIPS, 2014.
Conscious Recollection: Memory Systems of the Brain
(Oxford Psychology). Oxford University Press, 1 edition, Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B.
2004. ISBN 0195178041. Building a large annotated corpus of english: The penn
treebank. Computational Linguistics, 19(2), June 1993.
Elman, J. L. Distributed representations, simple recurrent
networks, and grammatical structure. Machine Learn- Mikolov, T. and Zweig, G. Context dependent recurrent
ing, 7(2-3):195225, 1991. neural network language model. In SLT, pp. 234239.
Graves, A., Wayne, G., and Danihelka, I. Neural turing IEEE, 2012. ISBN 978-1-4673-5125-6.
machines. CoRR, abs/1410.5401, 2014.
Passos, A., Kumar, V., and McCallum, A. Lexicon infused
Heckers, S., Zalesak, M., Weiss, A. P., Ditman, T., and phrase embeddings for named entity resolution. In Con-
Titone, D. Hippocampal activation during transitive in- ference on Computational Natural Language Learning.
ference in humans. Hippocampus, 14:15362, 2004. Association for Computational Linguistics, June 2014.
Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, Pennington, J., Socher, R., and Manning, C. D. Glove:
L., Kay, W., Suleyman, M., and Blunsom, P. Teaching Global vectors for word representation. In EMNLP,
machines to read and comprehend. In NIPS, 2015. 2014.
Ask Me Anything: Dynamic Memory Networks for Natural Language Processing
Yang, Z., He, X., Gao, J., Deng, L., and Smola, A. Stacked
attention networks for image question answering. arXiv
preprint arXiv:1511.02274, 2015.
Yates, A., Banko, M., Broadhead, M., Cafarella, M. J., Et-
zioni, O., and Soderland, S. Textrunner: Open informa-
tion extraction on the web. In HLT-NAACL (Demonstra-
tions), 2007.