Beruflich Dokumente
Kultur Dokumente
Abstract y1 y2 y3 y4
1556
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566,
c
Beijing, China, July 26-31, 2015.
2015 Association for Computational Linguistics
(RNNs) are a natural choice for sequence model- dimensional distributed representation of the se-
ing tasks. Recently, RNNs with Long Short-Term quence of tokens observed up to time t.
Memory (LSTM) units (Hochreiter and Schmid- Commonly, the RNN transition function is an
huber, 1997) have re-emerged as a popular archi- affine transformation followed by a pointwise non-
tecture due to their representational power and ef- linearity such as the hyperbolic tangent function:
fectiveness at capturing long-term dependencies.
LSTM networks, which we review in Sec. 2, have ht = tanh (W xt + U ht−1 + b) .
been successfully applied to a variety of sequence Unfortunately, a problem with RNNs with transi-
modeling and prediction tasks, notably machine tion functions of this form is that during training,
translation (Bahdanau et al., 2015; Sutskever et al., components of the gradient vector can grow or de-
2014), speech recognition (Graves et al., 2013), cay exponentially over long sequences (Hochre-
image caption generation (Vinyals et al., 2014), iter, 1998; Bengio et al., 1994). This problem with
and program execution (Zaremba and Sutskever, exploding or vanishing gradients makes it difficult
2014). for the RNN model to learn long-distance correla-
In this paper, we introduce a generalization of tions in a sequence.
the standard LSTM architecture to tree-structured The LSTM architecture (Hochreiter and
network topologies and show its superiority for Schmidhuber, 1997) addresses this problem of
representing sentence meaning over a sequential learning long-term dependencies by introducing a
LSTM. While the standard LSTM composes its memory cell that is able to preserve state over long
hidden state from the input at the current time periods of time. While numerous LSTM variants
step and the hidden state of the LSTM unit in the have been described, here we describe the version
previous time step, the tree-structured LSTM, or used by Zaremba and Sutskever (2014).
Tree-LSTM, composes its state from an input vec- We define the LSTM unit at each time step t to
tor and the hidden states of arbitrarily many child be a collection of vectors in Rd : an input gate it , a
units. The standard LSTM can then be considered forget gate ft , an output gate ot , a memory cell ct
a special case of the Tree-LSTM where each inter- and a hidden state ht . The entries of the gating
nal node has exactly one child. vectors it , ft and ot are in [0, 1]. We refer to d as
In our evaluations, we demonstrate the empiri- the memory dimension of the LSTM.
cal strength of Tree-LSTMs as models for repre- The LSTM transition equations are the follow-
senting sentences. We evaluate the Tree-LSTM ing:
architecture on two tasks: semantic relatedness
prediction on sentence pairs and sentiment clas- it = σ W (i) xt + U (i) ht−1 + b(i) , (1)
sification of sentences drawn from movie reviews.
Our experiments show that Tree-LSTMs outper- ft = σ W (f ) xt + U (f ) ht−1 + b(f ) ,
form existing systems and sequential LSTM base-
lines on both tasks. Implementations of our mod- ot = σ W (o) xt + U (o) ht−1 + b(o) ,
els and experiments are available at https:// ut = tanh W (u) xt + U (u) ht−1 + b(u) ,
github.com/stanfordnlp/treelstm.
ct = it ut + ft ct−1 ,
2 Long Short-Term Memory Networks ht = ot tanh(ct ),
2.1 Overview where xt is the input at the current time step, σ de-
Recurrent neural networks (RNNs) are able to pro- notes the logistic sigmoid function and denotes
cess input sequences of arbitrary length via the re- elementwise multiplication. Intuitively, the for-
cursive application of a transition function on a get gate controls the extent to which the previous
hidden state vector ht . At each time step t, the memory cell is forgotten, the input gate controls
hidden state ht is a function of the input vector xt how much each unit is updated, and the output gate
that the network receives at time t and its previous controls the exposure of the internal memory state.
hidden state ht−1 . For example, the input vector xt The hidden state vector in an LSTM unit is there-
could be a vector representation of the t-th word in fore a gated, partial view of the state of the unit’s
body of text (Elman, 1990; Mikolov, 2012). The internal memory cell. Since the value of the gating
hidden state ht ∈ Rd can be interpreted as a d- variables vary for each vector element, the model
1557
can learn to represent information over multiple c2
h2
time scales.
f2
2.2 Variants i1 o1
x1 u1 c1 h1
Two commonly-used variants of the basic LSTM
architecture are the Bidirectional LSTM and the f3
Multilayer LSTM (also known as the stacked or h3 c3
deep LSTM).
Bidirectional LSTM. A Bidirectional LSTM
(Graves et al., 2013) consists of two LSTMs that Figure 2: Composing the memory cell c1 and hid-
are run in parallel: one on the input sequence and den state h1 of a Tree-LSTM unit with two chil-
the other on the reverse of the input sequence. At dren (subscripts 2 and 3). Labeled edges cor-
each time step, the hidden state of the Bidirec- respond to gating by the indicated gating vector,
tional LSTM is the concatenation of the forward with dependencies omitted for compactness.
and backward hidden states. This setup allows the
hidden state to capture both past and future infor- task, or it can learn to preserve the representation
mation. of sentiment-rich children for sentiment classifica-
tion.
Multilayer LSTM. In Multilayer LSTM archi-
As with the standard LSTM, each Tree-LSTM
tectures, the hidden state of an LSTM unit in layer
unit takes an input vector xj . In our applications,
` is used as input to the LSTM unit in layer `+1 in
each xj is a vector representation of a word in a
the same time step (Graves et al., 2013; Sutskever
sentence. The input word at each node depends
et al., 2014; Zaremba and Sutskever, 2014). Here,
on the tree structure used for the network. For in-
the idea is to let the higher layers capture longer-
stance, in a Tree-LSTM over a dependency tree,
term dependencies of the input sequence.
each node in the tree takes the vector correspond-
These two variants can be combined as a Multi- ing to the head word as input, whereas in a Tree-
layer Bidirectional LSTM (Graves et al., 2013). LSTM over a constituency tree, the leaf nodes take
the corresponding word vectors as input.
3 Tree-Structured LSTMs
3.1 Child-Sum Tree-LSTMs
A limitation of the LSTM architectures described
in the previous section is that they only allow for Given a tree, let C(j) denote the set of children
strictly sequential information propagation. Here, of node j. The Child-Sum Tree-LSTM transition
we propose two natural extensions to the basic equations are the following:
LSTM architecture: the Child-Sum Tree-LSTM X
and the N-ary Tree-LSTM. Both variants allow for h̃j = hk , (2)
k∈C(j)
richer network topologies where each LSTM unit
is able to incorporate information from multiple ij = σ W (i) xj + U (i) h̃j + b(i) , (3)
child units.
As in standard LSTM units, each Tree-LSTM fjk = σ W (f ) xj + U (f ) hk + b(f ) , (4)
unit (indexed by j) contains input and output
oj = σ W (o) xj + U (o) h̃j + b(o) , (5)
gates ij and oj , a memory cell cj and hidden
state hj . The difference between the standard uj = tanh W (u) xj + U (u) h̃j + b(u) , (6)
LSTM unit and Tree-LSTM units is that gating X
vectors and memory cell updates are dependent cj = ij uj + fjk ck , (7)
on the states of possibly many child units. Ad- k∈C(j)
ditionally, instead of a single forget gate, the Tree- hj = oj tanh(cj ), (8)
LSTM unit contains one forget gate fjk for each
child k. This allows the Tree-LSTM unit to se- where in Eq. 4, k ∈ C(j).
lectively incorporate information from each child. Intuitively, we can interpret each parameter ma-
For example, a Tree-LSTM model can learn to em- trix in these equations as encoding correlations be-
phasize semantic heads in a semantic relatedness tween the component vectors of the Tree-LSTM
1558
unit, the input xj , and the hidden states hk of the model to learn more fine-grained conditioning on
unit’s children. For example, in a dependency tree the states of a unit’s children than the Child-
application, the model can learn parameters W (i) Sum Tree-LSTM. Consider, for example, a con-
such that the components of the input gate ij have stituency tree application where the left child of a
values close to 1 (i.e., “open”) when a semanti- node corresponds to a noun phrase, and the right
cally important content word (such as a verb) is child to a verb phrase. Suppose that in this case
given as input, and values close to 0 (i.e., “closed”) it is advantageous to emphasize the verb phrase
(f )
when the input is a relatively unimportant word in the representation. Then the Uk` parameters
(such as a determiner). can be trained such that the components of fj1 are
Dependency Tree-LSTMs. Since the Child- close to 0 (i.e., “forget”), while the components of
Sum Tree-LSTM unit conditions its components fj2 are close to 1 (i.e., “preserve”).
on the sum of child hidden states hk , it is well- Forget gate parameterization. In Eq. 10, we
suited for trees with high branching factor or define a parameterization of the kth child’s for-
whose children are unordered. For example, it is a get gate fjk that contains “off-diagonal” param-
good choice for dependency trees, where the num- (f )
eter matrices Uk` , k 6= `. This parameteriza-
ber of dependents of a head can be highly variable.
tion allows for more flexible control of informa-
We refer to a Child-Sum Tree-LSTM applied to a
tion propagation from child to parent. For exam-
dependency tree as a Dependency Tree-LSTM.
ple, this allows the left hidden state in a binary tree
3.2 N -ary Tree-LSTMs to have either an excitatory or inhibitory effect on
the forget gate of the right child. However, for
The N -ary Tree-LSTM can be used on tree struc-
large values of N , these additional parameters are
tures where the branching factor is at most N and
impractical and may be tied or fixed to zero.
where children are ordered, i.e., they can be in-
dexed from 1 to N . For any node j, write the hid- Constituency Tree-LSTMs. We can naturally
den state and memory cell of its kth child as hjk apply Binary Tree-LSTM units to binarized con-
and cjk respectively. The N -ary Tree-LSTM tran- stituency trees since left and right child nodes are
sition equations are the following: distinguished. We refer to this application of Bi-
N
! nary Tree-LSTMs as a Constituency Tree-LSTM.
X (i)
ij = σ W (i) xj + U` hj` + b(i) , (9) Note that in Constituency Tree-LSTMs, a node j
`=1 receives an input vector xj only if it is a leaf node.
N
!
X
fjk = σ W (f ) xj +
(f )
Uk` hj` + b(f ) , In the remainder of this paper, we focus on
`=1
the special cases of Dependency Tree-LSTMs and
(10) Constituency Tree-LSTMs. These architectures
N
! are in fact closely related; since we consider only
X
oj = σ W (o) xj +
(o)
U` hj` + b(o) , (11) binarized constituency trees, the parameterizations
`=1 of the two models are very similar. The key dif-
N
! ference is in the application of the compositional
X (u)
uj = tanh W (u) xj + U` hj` + b(u) , parameters: dependent vs. head for Dependency
`=1 Tree-LSTMs, and left child vs. right child for Con-
(12) stituency Tree-LSTMs.
N
X
cj = ij uj + fj` cj` , (13) 4 Models
`=1
We now describe two specific models that apply
hj = oj tanh(cj ), (14)
the Tree-LSTM architectures described in the pre-
where in Eq. 10, k = 1, 2, . . . , N . Note that vious section.
when the tree is simply a chain, both Eqs. 2–8
and Eqs. 9–14 reduce to the standard LSTM tran- 4.1 Tree-LSTM Classification
sitions, Eqs. 1. In this setting, we wish to predict labels ŷ from a
The introduction of separate parameter matri- discrete set of classes Y for some subset of nodes
ces for each child k allows the N -ary Tree-LSTM in a tree. For example, the label for a node in a
1559
parse tree could correspond to some property of comparison of the signs of the input representa-
the phrase spanned by that node. tions.
At each node j, we use a softmax classifier to We want the expected rating under the predicted
predict the label ŷj given the inputs {x}j observed distribution p̂θ given model parameters θ to be
at nodes in the subtree rooted at j. The classifier close to the gold rating y ∈ [1, K]: ŷ = rT p̂θ ≈ y.
takes the hidden state hj at the node as input: We therefore define a sparse target distribution1 p
that satisfies y = rT p:
p̂θ (y | {x}j ) = softmax W (s) hj + b(s) ,
y − byc, i = byc + 1
yˆj = arg max p̂θ (y | {x}j ) .
y pi = byc − y + 1, i = byc
The cost function is the negative log-likelihood 0 otherwise
of the true class labels y (k) at each labeled node:
for 1 ≤ i ≤ K. The cost function is the regular-
m
1 X λ ized KL-divergence between p and p̂θ :
J(θ) = − log p̂θ y (k) {x}(k) + kθk22 ,
m 2 m
λ
k=1 1 X
(k)
J(θ) = KL p(k)
p̂θ + kθk22 ,
where m is the number of labeled nodes in the m 2
k=1
training set, the superscript k indicates the kth la-
beled node, and λ is an L2 regularization hyperpa- where m is the number of training pairs and the
rameter. superscript k indicates the kth sentence pair.
1560
Relatedness Sentiment Method Fine-grained Binary
RAE (Socher et al., 2013) 43.2 82.4
LSTM Variant d |θ| d |θ|
MV-RNN (Socher et al., 2013) 44.4 82.9
Standard 150 203,400 168 315,840 RNTN (Socher et al., 2013) 45.7 85.4
Bidirectional 150 203,400 168 315,840 DCNN (Blunsom et al., 2014) 48.5 86.8
Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8
2-layer 108 203,472 120 318,720 CNN-non-static (Kim, 2014) 48.0 87.2
Bidirectional 2-layer 108 203,472 120 318,720 CNN-multichannel (Kim, 2014) 47.4 88.1
Constituency Tree 142 205,190 150 316,800 DRNN (Irsoy and Cardie, 2014) 49.8 86.6
Dependency Tree 150 203,400 168 315,840 LSTM 46.4 (1.1) 84.9 (0.6)
Bidirectional LSTM 49.1 (1.0) 87.5 (0.5)
2-layer LSTM 46.0 (1.3) 86.3 (0.6)
Table 1: Memory dimensions d and composition 2-layer Bidirectional LSTM 48.5 (1.0) 87.2 (1.0)
function parameter counts |θ| for each LSTM vari- Dependency Tree-LSTM 48.4 (0.4) 85.7 (0.4)
ant that we evaluate. Constituency Tree-LSTM
– randomly initialized vectors 43.9 (0.6) 82.0 (0.5)
– Glove vectors, fixed 49.7 (0.4) 87.5 (0.8)
neutral sentences are excluded). Standard bina- – Glove vectors, tuned 51.0 (0.5) 88.0 (0.3)
1561
Method Pearson’s r Spearman’s ρ MSE
Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692
UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550
Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224
ECNU (Zhao et al., 2014) 0.8414 – –
Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090)
DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137)
SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074)
LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092)
Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063)
2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150)
2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020)
Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108)
Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052)
Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean
scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval
2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs.
1562
0.70 0.90
0.65
0.88
0.60
0.86
0.55
accuracy
0.50 0.84
r
0.45
DT-LSTM
0.82 DT-LSTM
0.40 CT-LSTM CT-LSTM
LSTM 0.80 LSTM
0.35
Bi-LSTM Bi-LSTM
0.30 0.78
0 5 10 15 20 25 30 35 40 45 4 6 8 10 12 14 16 18 20
sentence length mean sentence length
Figure 3: Fine-grained sentiment classification ac- Figure 4: Pearson correlations r between pre-
curacy vs. sentence length. For each `, we plot dicted similarities and gold ratings vs. sentence
accuracy for the test set sentences with length in length. For each `, we plot r for the pairs with
the window [` − 2, ` + 2]. Examples in the tail mean length in the window [`−2, `+2]. Examples
of the length distribution are batched in the final in the tail of the length distribution are batched in
window (` = 45). the final window (` = 18.5).
tems without any additional feature engineering, bustness to differences in sentence length. Given
with the best results achieved by the Dependency the query “two men are playing guitar”, the Tree-
Tree-LSTM. Recall that in this task, both Tree- LSTM associates the phrase “playing guitar” with
LSTM models only receive supervision at the root the longer, related phrase “dancing and singing in
of the tree, in contrast to the sentiment classifi- front of a crowd” (note as well that there is zero
cation task where supervision was also provided token overlap between the two phrases).
at the intermediate nodes. We conjecture that in
7.2 Effect of Sentence Length
this setting, the Dependency Tree-LSTM benefits
from its more compact structure relative to the One hypothesis to explain the empirical strength
Constituency Tree-LSTM, in the sense that paths of Tree-LSTMs is that tree structures help miti-
from input word vectors to the root of the tree gate the problem of preserving state over long se-
are shorter on aggregate for the Dependency Tree- quences of words. If this were true, we would ex-
LSTM. pect to see the greatest improvement over sequen-
tial LSTMs on longer sentences. In Figs. 3 and 4,
7 Discussion and Qualitative Analysis we show the relationship between sentence length
and performance as measured by the relevant task-
7.1 Modeling Semantic Relatedness specific metric. Each data point is a mean score
In Table 4, we list nearest-neighbor sentences re- over 5 runs, and error bars have been omitted for
trieved from a 1000-sentence sample of the SICK clarity.
test set. We compare the neighbors ranked by the We observe that while the Dependency Tree-
Dependency Tree-LSTM model against a baseline LSTM does significantly outperform its sequen-
ranking by cosine similarity of the mean word vec- tial counterparts on the relatedness task for
tors for each sentence. longer sentences of length 13 to 15 (Fig. 4), it
The Dependency Tree-LSTM model exhibits also achieves consistently strong performance on
several desirable properties. Note that in the de- shorter sentences. This suggests that unlike se-
pendency parse of the second query sentence, the quential LSTMs, Tree-LSTMs are able to encode
word “ocean” is the second-furthest word from the semantically-useful structural information in the
root (“waving”), with a depth of 4. Regardless, the sentence representations that they compose.
retrieved sentences are all semantically related to
8 Related Work
the word “ocean”, which indicates that the Tree-
LSTM is able to both preserve and emphasize in- Distributed representations of words (Rumelhart
formation from relatively distant nodes. Addi- et al., 1988; Collobert et al., 2011; Turian et al.,
tionally, the Tree-LSTM model shows greater ro- 2010; Huang et al., 2012; Mikolov et al., 2013;
1563
Ranking by mean word vector cosine similarity Score Ranking by Dependency Tree-LSTM model Score
a woman is slicing potatoes a woman is slicing potatoes
a woman is cutting potatoes 0.96 a woman is cutting potatoes 4.82
a woman is slicing herbs 0.92 potatoes are being sliced by a woman 4.70
a woman is slicing tofu 0.92 tofu is being sliced by a woman 4.39
a boy is waving at some young runners from the ocean a boy is waving at some young runners from the ocean
a man and a boy are standing at the bottom of some stairs , 0.92 a group of men is playing with a ball on the beach 3.79
which are outdoors
a group of children in uniforms is standing at a gate and 0.90 a young boy wearing a red swimsuit is jumping out of a 3.37
one is kissing the mother blue kiddies pool
a group of children in uniforms is standing at a gate and 0.90 the man is tossing a kid into the swimming pool that is 3.19
there is no one kissing the mother near the ocean
two men are playing guitar two men are playing guitar
some men are playing rugby 0.88 the man is singing and playing the guitar 4.08
two men are talking 0.87 the man is opening the guitar for donations and plays 4.01
with the case
two dogs are playing with each other 0.87 two men are dancing and singing in front of a crowd 4.00
Table 4: Most similar sentences from a 1000-sentence sample drawn from the SICK test set. The Tree-
LSTM model is able to pick up on more subtle relationships, such as that between “beach” and “ocean”
in the second example.
Pennington et al., 2014) have found wide appli- and sentiment classification, outperforming exist-
cability in a variety of NLP tasks. Following ing systems on both. Controlling for model di-
this success, there has been substantial interest in mensionality, we demonstrated that Tree-LSTM
the area of learning distributed phrase and sen- models are able to outperform their sequential
tence representations (Mitchell and Lapata, 2010; counterparts. Our results suggest further lines of
Yessenalina and Cardie, 2011; Grefenstette et al., work in characterizing the role of structure in pro-
2013; Mikolov et al., 2013), as well as distributed ducing distributed representations of sentences.
representations of longer bodies of text such as
paragraphs and documents (Srivastava et al., 2013; Acknowledgements
Le and Mikolov, 2014).
We thank our anonymous reviewers for their valu-
Our approach builds on recursive neural net-
able feedback. Stanford University gratefully ac-
works (Goller and Kuchler, 1996; Socher et al.,
knowledges the support of a Natural Language
2011), which we abbreviate as Tree-RNNs in or-
Understanding-focused gift from Google Inc. and
der to avoid confusion with recurrent neural net-
the Defense Advanced Research Projects Agency
works. Under the Tree-RNN framework, the vec-
(DARPA) Deep Exploration and Filtering of Text
tor representation associated with each node of
(DEFT) Program under Air Force Research Lab-
a tree is composed as a function of the vectors
oratory (AFRL) contract no. FA8750-13-2-0040.
corresponding to the children of the node. The
Any opinions, findings, and conclusion or recom-
choice of composition function gives rise to nu-
mendations expressed in this material are those of
merous variants of this basic framework. Tree-
the authors and do not necessarily reflect the view
RNNs have been used to parse images of natu-
of the DARPA, AFRL, or the US government.
ral scenes (Socher et al., 2011), compose phrase
representations from word vectors (Socher et al., References
2012), and classify the sentiment polarity of sen-
tences (Socher et al., 2013). Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
9 Conclusion jointly learning to align and translate. In Pro-
In this paper, we introduced a generalization of ceedings of the 3rd International Conference on
LSTMs to tree-structured network topologies. The Learning Representations (ICLR 2015).
Tree-LSTM architecture can be applied to trees Bengio, Yoshua, Patrice Simard, and Paolo Fras-
with arbitrary branching factor. We demonstrated coni. 1994. Learning long-term dependencies
the effectiveness of the Tree-LSTM by applying with gradient descent is difficult. IEEE Trans-
the architecture in two tasks: semantic relatedness actions on Neural Networks 5(2).
1564
Bjerva, Johannes, Johan Bos, Rob van der Goot, learning for compositional distributional se-
and Malvina Nissim. 2014. The Meaning Fac- mantics. In Proceedings of the 10th Interna-
tory: Formal semantics for recognizing textual tional Conference on Computational Semantics.
entailment and determining semantic similarity. Hochreiter, Sepp. 1998. The vanishing gradient
In Proceedings of the 8th International Work- problem during learning recurrent neural nets
shop on Semantic Evaluation (SemEval 2014). and problem solutions. International Journal of
Blunsom, Phil, Edward Grefenstette, Nal Kalch- Uncertainty, Fuzziness and Knowledge-Based
brenner, et al. 2014. A convolutional neural net- Systems 6(02):107–116.
work for modelling sentences. In Proceedings Hochreiter, Sepp and Jürgen Schmidhuber. 1997.
of the 52nd Annual Meeting of the Association Long Short-Term Memory. Neural Computa-
for Computational Linguistics. tion 9(8).
Chen, Danqi and Christopher D Manning. 2014. A Huang, Eric H., Richard Socher, Christopher D.
fast and accurate dependency parser using neu- Manning, and Andrew Y. Ng. 2012. Improv-
ral networks. In Proceedings of the 2014 Con- ing word representations via global context and
ference on Empirical Methods in Natural Lan- multiple word prototypes. In Annual Meeting
guage Processing (EMNLP). of the Association for Computational Linguis-
Collobert, Ronan, Jason Weston, Léon Bottou, tics (ACL).
Michael Karlen, Koray Kavukcuoglu, and Pavel Irsoy, Ozan and Claire Cardie. 2014. Deep re-
Kuksa. 2011. Natural language processing (al- cursive neural networks for compositionality in
most) from scratch. The Journal of Machine language. In Advances in Neural Information
Learning Research 12:2493–2537. Processing Systems.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. Jimenez, Sergio, George Duenas, Julia Baquero,
Adaptive subgradient methods for online learn- Alexander Gelbukh, Av Juan Dios Bátiz, and
ing and stochastic optimization. The Journal of Av Mendizábal. 2014. UNAL-NLP: Combin-
Machine Learning Research 12:2121–2159. ing soft cardinality features for semantic textual
Elman, Jeffrey L. 1990. Finding structure in time. similarity, relatedness and entailment. In Pro-
Cognitive science 14(2):179–211. ceedings of the 8th International Workshop on
Foltz, Peter W, Walter Kintsch, and Thomas K Semantic Evaluation (SemEval 2014).
Landauer. 1998. The measurement of textual Kim, Yoon. 2014. Convolutional neural networks
coherence with latent semantic analysis. Dis- for sentence classification. In Proceedings of
course processes 25(2-3):285–307. the 2014 Conference on Empirical Methods in
Ganitkevitch, Juri, Benjamin Van Durme, and Natural Language Processing (EMNLP).
Chris Callison-Burch. 2013. PPDB: The Para- Klein, Dan and Christopher D Manning. 2003.
phrase Database. In Proceedings of HLT- Accurate unlexicalized parsing. In Proceedings
NAACL 2013. of the 41st Annual Meeting on Association for
Goller, Christoph and Andreas Kuchler. 1996. Computational Linguistics.
Learning task-dependent distributed representa- Lai, Alice and Julia Hockenmaier. 2014. Illinois-
tions by backpropagation through structure. In LH: A denotational and distributional approach
IEEE International Conference on Neural Net- to semantics. In Proceedings of the 8th Inter-
works. national Workshop on Semantic Evaluation (Se-
Graves, Alex, Navdeep Jaitly, and A-R Mohamed. mEval 2014).
2013. Hybrid speech recognition with deep Landauer, Thomas K and Susan T Dumais. 1997.
bidirectional LSTM. In IEEE Workshop on Au- A solution to Plato’s problem: The latent se-
tomatic Speech Recognition and Understanding mantic analysis theory of acquisition, induction,
(ASRU). and representation of knowledge. Psychological
Grefenstette, Edward, Georgiana Dinu, Yao- review 104(2):211.
Zhong Zhang, Mehrnoosh Sadrzadeh, and Le, Quoc and Tomas Mikolov. 2014. Distributed
Marco Baroni. 2013. Multi-step regression representations of sentences and documents. In
1565
Proceedings of the 31st International Confer- Socher, Richard, Alex Perelygin, Jean Y Wu,
ence on Machine Learning (ICML-14). Jason Chuang, Christopher D Manning, An-
drew Y Ng, and Christopher Potts. 2013. Re-
Marelli, Marco, Luisa Bentivogli, Marco Ba-
cursive deep models for semantic composition-
roni, Raffaella Bernardi, Stefano Menini, and
ality over a sentiment treebank. In Proceedings
Roberto Zamparelli. 2014. SemEval-2014 Task
of the 2013 Conference on Empirical Methods
1: Evaluation of compositional distributional
in Natural Language Processing (EMNLP).
semantic models on full sentences through se-
mantic relatedness and textual entailment. In Srivastava, Nitish, Geoffrey Hinton, Alex
Proceedings of the 8th International Workshop Krizhevsky, Ilya Sutskever, and Ruslan
on Semantic Evaluation (SemEval 2014). Salakhutdinov. 2014. Dropout: A simple
way to prevent neural networks from overfit-
Mikolov, Tomáš. 2012. Statistical Language Mod-
ting. Journal of Machine Learning Research
els Based on Neural Networks. Ph.D. thesis,
15:1929–1958.
Brno University of Technology.
Srivastava, Nitish, Ruslan Salakhutdinov, and Ge-
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S offrey Hinton. 2013. Modeling documents with
Corrado, and Jeff Dean. 2013. Distributed a Deep Boltzmann Machine. In Uncertainty in
representations of words and phrases and their Artificial Intelligence.
compositionality. In Advances in Neural Infor-
Sutskever, Ilya, Oriol Vinyals, and Quoc VV Le.
mation Processing Systems.
2014. Sequence to sequence learning with neu-
Mitchell, Jeff and Mirella Lapata. 2010. Composi- ral networks. In Advances in Neural Informa-
tion in distributional models of semantics. Cog- tion Processing Systems.
nitive science 34(8):1388–1429.
Turian, Joseph, Lev Ratinov, and Yoshua Bengio.
Pennington, Jeffrey, Richard Socher, and Christo- 2010. Word representations: A simple and gen-
pher D Manning. 2014. Glove: Global vectors eral method for semi-supervised learning. In
for word representation. In Proceedings of the Proceedings of the 48th Annual Meeting of the
2014 Conference on Empiricial Methods in Nat- Association for Computational Linguistics.
ural Language Processing (EMNLP). Vinyals, Oriol, Alexander Toshev, Samy Bengio,
Rumelhart, David E, Geoffrey E Hinton, and and Dumitru Erhan. 2014. Show and tell: A
Ronald J Williams. 1988. Learning represen- neural image caption generator. arXiv preprint
tations by back-propagating errors. Cognitive arXiv:1411.4555 .
modeling 5. Yessenalina, Ainur and Claire Cardie. 2011. Com-
Socher, Richard, Brody Huval, Christopher D positional matrix-space models for sentiment
Manning, and Andrew Y Ng. 2012. Seman- analysis. In Proceedings of the 2011 Confer-
tic compositionality through recursive matrix- ence on Empirical Methods in Natural Lan-
vector spaces. In Proceedings of the 2012 Joint guage Processing (EMNLP).
Conference on Empirical Methods in Natural Zaremba, Wojciech and Ilya Sutskever.
Language Processing and Computational Nat- 2014. Learning to execute. arXiv preprint
ural Language Learning (EMNLP). arXiv:1410.4615 .
Socher, Richard, Andrej Karpathy, Quoc V Le, Zhao, Jiang, Tian Tian Zhu, and Man Lan. 2014.
Christopher D Manning, and Andrew Y Ng. ECNU: One stone two birds: Ensemble of het-
2014. Grounded compositional semantics for erogenous measures for semantic relatedness
finding and describing images with sentences. and textual entailment. In Proceedings of the
Transactions of the Association for Computa- 8th International Workshop on Semantic Evalu-
tional Linguistics 2. ation (SemEval 2014).
Socher, Richard, Cliff C Lin, Chris Manning, and
Andrew Y Ng. 2011. Parsing natural scenes
and natural language with recursive neural net-
works. In Proceedings of the 28th International
Conference on Machine Learning (ICML-11).
1566