Improves Semantics PDF

Improved Semantic Representations From
Tree-Structured Long Short-Term Memory Networks
Kai Sheng Tai, Richard Socher*, Christopher D. Manning

Computer Science Department, Stanford University, *MetaMind Inc.
kst@cs.stanford.edu, richard@metamind.io, manning@stanford.edu
Abstract y1 y2 y3 y4
Because of their superior ability to pre-

serve sequence information over time,
Long Short-Term Memory (LSTM) net- x1 x2 x3 x4
works, a type of recurrent neural net- y1
work with a more complex computational
unit, have obtained strong results on a va-
y2 y3
riety of sequence modeling tasks. The
x1
only underlying LSTM structure that has
been explored so far is a linear chain. y4 y6
However, natural language exhibits syn- x2
tactic properties that would naturally com-
bine words to phrases. We introduce the
x4 x5 x6
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies. Tree-
LSTMs outperform all existing systems Figure 1: Top: A chain-structured LSTM net-
and strong LSTM baselines on two tasks: work. Bottom: A tree-structured LSTM network
predicting the semantic relatedness of two with arbitrary branching factor.
sentences (SemEval 2014, Task 1) and
sentiment classification (Stanford Senti-
ment Treebank). Order-insensitive models are insufficient to
fully capture the semantics of natural language
1 Introduction due to their inability to account for differences in
Most models for distributed representations of meaning as a result of differences in word order
phrases and sentences—that is, models where real- or syntactic structure (e.g., “cats climb trees” vs.
valued vectors are used to represent meaning—fall “trees climb cats”). We therefore turn to order-
into one of three classes: bag-of-words models, sensitive sequential or tree-structured models. In
sequence models, and tree-structured models. In particular, tree-structured models are a linguisti-
bag-of-words models, phrase and sentence repre- cally attractive option due to their relation to syn-
sentations are independent of word order; for ex- tactic interpretations of sentence structure. A nat-
ample, they can be generated by averaging con- ural question, then, is the following: to what ex-
stituent word representations (Landauer and Du- tent (if at all) can we do better with tree-structured
mais, 1997; Foltz et al., 1998). In contrast, se- models as opposed to sequential models for sen-
quence models construct sentence representations tence representation? In this paper, we work to-
as an order-sensitive function of the sequence of wards addressing this question by directly com-
tokens (Elman, 1990; Mikolov, 2012). Lastly, paring a type of sequential model that has recently
tree-structured models compose each phrase and been used to achieve state-of-the-art results in sev-
sentence representation from its constituent sub- eral NLP tasks against its tree-structured general-
phrases according to a given syntactic structure ization.
over the sentence (Goller and Kuchler, 1996; Due to their capability for processing arbitrary-
Socher et al., 2011). length sequences, recurrent neural networks
1556
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 1556–1566,
c
Beijing, China, July 26-31, 2015. 2015 Association for Computational Linguistics
(RNNs) are a natural choice for sequence model- dimensional distributed representation of the se-
ing tasks. Recently, RNNs with Long Short-Term quence of tokens observed up to time t.
Memory (LSTM) units (Hochreiter and Schmid- Commonly, the RNN transition function is an
huber, 1997) have re-emerged as a popular archi- affine transformation followed by a pointwise non-
tecture due to their representational power and ef- linearity such as the hyperbolic tangent function:
fectiveness at capturing long-term dependencies.
LSTM networks, which we review in Sec. 2, have ht = tanh (W xt + U ht−1 + b) .
been successfully applied to a variety of sequence Unfortunately, a problem with RNNs with transi-
modeling and prediction tasks, notably machine tion functions of this form is that during training,
translation (Bahdanau et al., 2015; Sutskever et al., components of the gradient vector can grow or de-
2014), speech recognition (Graves et al., 2013), cay exponentially over long sequences (Hochre-
image caption generation (Vinyals et al., 2014), iter, 1998; Bengio et al., 1994). This problem with
and program execution (Zaremba and Sutskever, exploding or vanishing gradients makes it difficult
2014). for the RNN model to learn long-distance correla-
In this paper, we introduce a generalization of tions in a sequence.
the standard LSTM architecture to tree-structured The LSTM architecture (Hochreiter and
network topologies and show its superiority for Schmidhuber, 1997) addresses this problem of
representing sentence meaning over a sequential learning long-term dependencies by introducing a
LSTM. While the standard LSTM composes its memory cell that is able to preserve state over long
hidden state from the input at the current time periods of time. While numerous LSTM variants
step and the hidden state of the LSTM unit in the have been described, here we describe the version
previous time step, the tree-structured LSTM, or used by Zaremba and Sutskever (2014).
Tree-LSTM, composes its state from an input vec- We define the LSTM unit at each time step t to
tor and the hidden states of arbitrarily many child be a collection of vectors in Rd : an input gate it , a
units. The standard LSTM can then be considered forget gate ft , an output gate ot , a memory cell ct
a special case of the Tree-LSTM where each inter- and a hidden state ht . The entries of the gating
nal node has exactly one child. vectors it , ft and ot are in [0, 1]. We refer to d as
In our evaluations, we demonstrate the empiri- the memory dimension of the LSTM.
cal strength of Tree-LSTMs as models for repre- The LSTM transition equations are the follow-
senting sentences. We evaluate the Tree-LSTM ing:
architecture on two tasks: semantic relatedness
prediction on sentence pairs and sentiment clas- it = σ W (i) xt + U (i) ht−1 + b(i) , (1)
sification of sentences drawn from movie reviews.
Our experiments show that Tree-LSTMs outper- ft = σ W (f ) xt + U (f ) ht−1 + b(f ) ,
form existing systems and sequential LSTM base-
lines on both tasks. Implementations of our mod- ot = σ W (o) xt + U (o) ht−1 + b(o) ,

els and experiments are available at https:// ut = tanh W (u) xt + U (u) ht−1 + b(u) ,
github.com/stanfordnlp/treelstm.
ct = it ut + ft ct−1 ,
2 Long Short-Term Memory Networks ht = ot tanh(ct ),
2.1 Overview where xt is the input at the current time step, σ de-
Recurrent neural networks (RNNs) are able to pro- notes the logistic sigmoid function and denotes
cess input sequences of arbitrary length via the re- elementwise multiplication. Intuitively, the for-
cursive application of a transition function on a get gate controls the extent to which the previous
hidden state vector ht . At each time step t, the memory cell is forgotten, the input gate controls
hidden state ht is a function of the input vector xt how much each unit is updated, and the output gate
that the network receives at time t and its previous controls the exposure of the internal memory state.
hidden state ht−1 . For example, the input vector xt The hidden state vector in an LSTM unit is there-
could be a vector representation of the t-th word in fore a gated, partial view of the state of the unit’s
body of text (Elman, 1990; Mikolov, 2012). The internal memory cell. Since the value of the gating
hidden state ht ∈ Rd can be interpreted as a d- variables vary for each vector element, the model
1557
can learn to represent information over multiple c2
h2
time scales.
f2
2.2 Variants i1 o1
x1 u1 c1 h1
Two commonly-used variants of the basic LSTM
architecture are the Bidirectional LSTM and the f3
Multilayer LSTM (also known as the stacked or h3 c3
deep LSTM).
Bidirectional LSTM. A Bidirectional LSTM
(Graves et al., 2013) consists of two LSTMs that Figure 2: Composing the memory cell c1 and hid-
are run in parallel: one on the input sequence and den state h1 of a Tree-LSTM unit with two chil-
the other on the reverse of the input sequence. At dren (subscripts 2 and 3). Labeled edges cor-
each time step, the hidden state of the Bidirec- respond to gating by the indicated gating vector,
tional LSTM is the concatenation of the forward with dependencies omitted for compactness.
and backward hidden states. This setup allows the
hidden state to capture both past and future infor- task, or it can learn to preserve the representation
mation. of sentiment-rich children for sentiment classifica-
tion.
Multilayer LSTM. In Multilayer LSTM archi-
As with the standard LSTM, each Tree-LSTM
tectures, the hidden state of an LSTM unit in layer
unit takes an input vector xj . In our applications,
` is used as input to the LSTM unit in layer `+1 in
each xj is a vector representation of a word in a
the same time step (Graves et al., 2013; Sutskever
sentence. The input word at each node depends
et al., 2014; Zaremba and Sutskever, 2014). Here,
on the tree structure used for the network. For in-
the idea is to let the higher layers capture longer-
stance, in a Tree-LSTM over a dependency tree,
term dependencies of the input sequence.
each node in the tree takes the vector correspond-
These two variants can be combined as a Multi- ing to the head word as input, whereas in a Tree-
layer Bidirectional LSTM (Graves et al., 2013). LSTM over a constituency tree, the leaf nodes take
the corresponding word vectors as input.
3 Tree-Structured LSTMs
3.1 Child-Sum Tree-LSTMs
A limitation of the LSTM architectures described
in the previous section is that they only allow for Given a tree, let C(j) denote the set of children
strictly sequential information propagation. Here, of node j. The Child-Sum Tree-LSTM transition
we propose two natural extensions to the basic equations are the following:
LSTM architecture: the Child-Sum Tree-LSTM X
and the N-ary Tree-LSTM. Both variants allow for h̃j = hk , (2)
k∈C(j)
richer network topologies where each LSTM unit
is able to incorporate information from multiple ij = σ W (i) xj + U (i) h̃j + b(i) , (3)
child units.
As in standard LSTM units, each Tree-LSTM fjk = σ W (f ) xj + U (f ) hk + b(f ) , (4)
unit (indexed by j) contains input and output
oj = σ W (o) xj + U (o) h̃j + b(o) , (5)
gates ij and oj , a memory cell cj and hidden

state hj . The difference between the standard uj = tanh W (u) xj + U (u) h̃j + b(u) , (6)
LSTM unit and Tree-LSTM units is that gating X
vectors and memory cell updates are dependent cj = ij uj + fjk ck , (7)
on the states of possibly many child units. Ad- k∈C(j)
ditionally, instead of a single forget gate, the Tree- hj = oj tanh(cj ), (8)
LSTM unit contains one forget gate fjk for each
child k. This allows the Tree-LSTM unit to se- where in Eq. 4, k ∈ C(j).
lectively incorporate information from each child. Intuitively, we can interpret each parameter ma-
For example, a Tree-LSTM model can learn to em- trix in these equations as encoding correlations be-
phasize semantic heads in a semantic relatedness tween the component vectors of the Tree-LSTM
1558
unit, the input xj , and the hidden states hk of the model to learn more fine-grained conditioning on
unit’s children. For example, in a dependency tree the states of a unit’s children than the Child-
application, the model can learn parameters W (i) Sum Tree-LSTM. Consider, for example, a con-
such that the components of the input gate ij have stituency tree application where the left child of a
values close to 1 (i.e., “open”) when a semanti- node corresponds to a noun phrase, and the right
cally important content word (such as a verb) is child to a verb phrase. Suppose that in this case
given as input, and values close to 0 (i.e., “closed”) it is advantageous to emphasize the verb phrase
(f )
when the input is a relatively unimportant word in the representation. Then the Uk` parameters
(such as a determiner). can be trained such that the components of fj1 are
Dependency Tree-LSTMs. Since the Child- close to 0 (i.e., “forget”), while the components of
Sum Tree-LSTM unit conditions its components fj2 are close to 1 (i.e., “preserve”).
on the sum of child hidden states hk , it is well- Forget gate parameterization. In Eq. 10, we
suited for trees with high branching factor or define a parameterization of the kth child’s for-
whose children are unordered. For example, it is a get gate fjk that contains “off-diagonal” param-
good choice for dependency trees, where the num- (f )
eter matrices Uk` , k 6= `. This parameteriza-
ber of dependents of a head can be highly variable.
tion allows for more flexible control of informa-
We refer to a Child-Sum Tree-LSTM applied to a
tion propagation from child to parent. For exam-
dependency tree as a Dependency Tree-LSTM.
ple, this allows the left hidden state in a binary tree
3.2 N -ary Tree-LSTMs to have either an excitatory or inhibitory effect on
the forget gate of the right child. However, for
The N -ary Tree-LSTM can be used on tree struc-
large values of N , these additional parameters are
tures where the branching factor is at most N and
impractical and may be tied or fixed to zero.
where children are ordered, i.e., they can be in-
dexed from 1 to N . For any node j, write the hid- Constituency Tree-LSTMs. We can naturally
den state and memory cell of its kth child as hjk apply Binary Tree-LSTM units to binarized con-
and cjk respectively. The N -ary Tree-LSTM tran- stituency trees since left and right child nodes are
sition equations are the following: distinguished. We refer to this application of Bi-
N
! nary Tree-LSTMs as a Constituency Tree-LSTM.
X (i)
ij = σ W (i) xj + U` hj` + b(i) , (9) Note that in Constituency Tree-LSTMs, a node j
`=1 receives an input vector xj only if it is a leaf node.
N
!
X
fjk = σ W (f ) xj +
(f )
Uk` hj` + b(f ) , In the remainder of this paper, we focus on
`=1
the special cases of Dependency Tree-LSTMs and
(10) Constituency Tree-LSTMs. These architectures
N
! are in fact closely related; since we consider only
X
oj = σ W (o) xj +
(o)
U` hj` + b(o) , (11) binarized constituency trees, the parameterizations
`=1 of the two models are very similar. The key dif-
N
! ference is in the application of the compositional
X (u)
uj = tanh W (u) xj + U` hj` + b(u) , parameters: dependent vs. head for Dependency
`=1 Tree-LSTMs, and left child vs. right child for Con-
(12) stituency Tree-LSTMs.
N
X
cj = ij uj + fj` cj` , (13) 4 Models
`=1
We now describe two specific models that apply
hj = oj tanh(cj ), (14)
the Tree-LSTM architectures described in the pre-
where in Eq. 10, k = 1, 2, . . . , N . Note that vious section.
when the tree is simply a chain, both Eqs. 2–8
and Eqs. 9–14 reduce to the standard LSTM tran- 4.1 Tree-LSTM Classification
sitions, Eqs. 1. In this setting, we wish to predict labels ŷ from a
The introduction of separate parameter matri- discrete set of classes Y for some subset of nodes
ces for each child k allows the N -ary Tree-LSTM in a tree. For example, the label for a node in a
1559
parse tree could correspond to some property of comparison of the signs of the input representa-
the phrase spanned by that node. tions.
At each node j, we use a softmax classifier to We want the expected rating under the predicted
predict the label ŷj given the inputs {x}j observed distribution p̂θ given model parameters θ to be
at nodes in the subtree rooted at j. The classifier close to the gold rating y ∈ [1, K]: ŷ = rT p̂θ ≈ y.
takes the hidden state hj at the node as input: We therefore define a sparse target distribution1 p
that satisfies y = rT p:
p̂θ (y | {x}j ) = softmax W (s) hj + b(s) , 

y − byc, i = byc + 1
yˆj = arg max p̂θ (y | {x}j ) .
y pi = byc − y + 1, i = byc


The cost function is the negative log-likelihood 0 otherwise
of the true class labels y (k) at each labeled node:
for 1 ≤ i ≤ K. The cost function is the regular-
m
1 X λ ized KL-divergence between p and p̂θ :

J(θ) = − log p̂θ y (k) {x}(k) + kθk22 ,
m 2 m λ
k=1 1 X (k)
J(θ) = KL p(k) p̂θ + kθk22 ,
where m is the number of labeled nodes in the m 2
k=1
training set, the superscript k indicates the kth la-
beled node, and λ is an L2 regularization hyperpa- where m is the number of training pairs and the
rameter. superscript k indicates the kth sentence pair.
4.2 Semantic Relatedness of Sentence Pairs 5 Experiments

Given a sentence pair, we wish to predict a We evaluate our Tree-LSTM architectures on two
real-valued similarity score in some range [1, K], tasks: (1) sentiment classification of sentences
where K > 1 is an integer. The sequence sampled from movie reviews and (2) predicting
{1, 2, . . . , K} is some ordinal scale of similarity, the semantic relatedness of sentence pairs.
where higher scores indicate greater degrees of In comparing our Tree-LSTMs against sequen-
similarity, and we allow real-valued scores to ac- tial LSTMs, we control for the number of LSTM
count for ground-truth ratings that are an average parameters by varying the dimensionality of the
over the evaluations of several human annotators. hidden states2 . Details for each model variant are
We first produce sentence representations hL summarized in Table 1.
and hR for each sentence in the pair using a
Tree-LSTM model over each sentence’s parse tree. 5.1 Sentiment Classification
Given these sentence representations, we predict In this task, we predict the sentiment of sen-
the similarity score ŷ using a neural network that tences sampled from movie reviews. We use
considers both the distance and angle between the the Stanford Sentiment Treebank (Socher et al.,
pair (hL , hR ): 2013). There are two subtasks: binary classifica-
tion of sentences, and fine-grained classification
h× = hL hR , (15) over five classes: very negative, negative, neu-
h+ = |hL − hR |, tral, positive, and very positive. We use the stan-

hs = σ W (×) h× + W (+) h+ + b(h) , dard train/dev/test splits of 6920/872/1821 for the
binary classification subtask and 8544/1101/2210
p̂θ = softmax W (p) hs + b(p) , for the fine-grained classification subtask (there
are fewer examples for the binary subtask since
ŷ = rT p̂θ ,
1
In the subsequent experiments, we found that optimizing
where rT = [1 2 . . . K] and the absolute value this objective yielded better performance than a mean squared
error objective.
function is applied elementwise. The use of both 2
For our Bidirectional LSTMs, the parameters of the for-
distance measures h× and h+ is empirically mo- ward and backward transition functions are shared. In our
tivated: we find that the combination outperforms experiments, this achieved superior performance to Bidirec-
tional LSTMs with untied weights and the same number of
the use of either measure alone. The multiplicative parameters (and therefore smaller hidden vector dimension-
measure h× can be interpreted as an elementwise ality).
1560
Relatedness Sentiment Method Fine-grained Binary
RAE (Socher et al., 2013) 43.2 82.4
LSTM Variant d |θ| d |θ|
MV-RNN (Socher et al., 2013) 44.4 82.9
Standard 150 203,400 168 315,840 RNTN (Socher et al., 2013) 45.7 85.4
Bidirectional 150 203,400 168 315,840 DCNN (Blunsom et al., 2014) 48.5 86.8
Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8
2-layer 108 203,472 120 318,720 CNN-non-static (Kim, 2014) 48.0 87.2
Bidirectional 2-layer 108 203,472 120 318,720 CNN-multichannel (Kim, 2014) 47.4 88.1
Constituency Tree 142 205,190 150 316,800 DRNN (Irsoy and Cardie, 2014) 49.8 86.6
Dependency Tree 150 203,400 168 315,840 LSTM 46.4 (1.1) 84.9 (0.6)
Bidirectional LSTM 49.1 (1.0) 87.5 (0.5)
2-layer LSTM 46.0 (1.3) 86.3 (0.6)
Table 1: Memory dimensions d and composition 2-layer Bidirectional LSTM 48.5 (1.0) 87.2 (1.0)
function parameter counts |θ| for each LSTM vari- Dependency Tree-LSTM 48.4 (0.4) 85.7 (0.4)
ant that we evaluate. Constituency Tree-LSTM
– randomly initialized vectors 43.9 (0.6) 82.0 (0.5)
– Glove vectors, fixed 49.7 (0.4) 87.5 (0.8)
neutral sentences are excluded). Standard bina- – Glove vectors, tuned 51.0 (0.5) 88.0 (0.3)
rized constituency parse trees are provided for

each sentence in the dataset, and each node in Table 2: Test set accuracies on the Stanford Sen-
these trees is annotated with a sentiment label. timent Treebank. For our experiments, we report
For the sequential LSTM baselines, we predict mean accuracies over 5 runs (standard deviations
the sentiment of a phrase using the representation in parentheses). Fine-grained: 5-class sentiment
given by the final LSTM hidden state. The sequen- classification. Binary: positive/negative senti-
tial LSTM models are trained on the spans corre- ment classification.
sponding to labeled nodes in the training set.
We use the classification model described in produce binarized constituency parses4 and depen-
Sec. 4.1 with both Dependency Tree-LSTMs dency parses of the sentences in the dataset for our
(Sec. 3.1) and Constituency Tree-LSTMs Constituency Tree-LSTM and Dependency Tree-
(Sec. 3.2). The Constituency Tree-LSTMs are LSTM models.
structured according to the provided parse trees.
For the Dependency Tree-LSTMs, we produce 5.3 Hyperparameters and Training Details
dependency parses3 of each sentence; each node The hyperparameters for our models were tuned
in a tree is given a sentiment label if its span on the development set for each task.
matches a labeled span in the training set.
We initialized our word representations using
5.2 Semantic Relatedness publicly available 300-dimensional Glove vec-
tors5 (Pennington et al., 2014). For the sentiment
For a given pair of sentences, the semantic relat-
classification task, word representations were up-
edness task is to predict a human-generated rating
dated during training with a learning rate of 0.1.
of the similarity of the two sentences in meaning.
For the semantic relatedness task, word represen-
We use the Sentences Involving Composi-
tations were held fixed as we did not observe any
tional Knowledge (SICK) dataset (Marelli et al.,
significant improvement when the representations
2014), consisting of 9927 sentence pairs in a
were tuned.
4500/500/4927 train/dev/test split. The sentences
Our models were trained using AdaGrad (Duchi
are derived from existing image and video descrip-
et al., 2011) with a learning rate of 0.05 and a
tion datasets. Each sentence pair is annotated with
minibatch size of 25. The model parameters were
a relatedness score y ∈ [1, 5], with 1 indicating
regularized with a per-minibatch L2 regularization
that the two sentences are completely unrelated,
strength of 10−4 . The sentiment classifier was
and 5 indicating that the two sentences are very
additionally regularized using dropout (Srivastava
related. Each label is the average of 10 ratings as-
et al., 2014) with a dropout rate of 0.5. We did not
signed by different human annotators.
observe performance gains using dropout on the
Here, we use the similarity model described in
semantic relatedness task.
Sec. 4.2. For the similarity prediction network
4
(Eqs. 15) we use a hidden layer of size 50. We Constituency parses produced by the Stanford PCFG
Parser (Klein and Manning, 2003).
3 5
Dependency parses produced by the Stanford Neural Trained on 840 billion tokens of Common Crawl data,
Network Dependency Parser (Chen and Manning, 2014). http://nlp.stanford.edu/projects/glove/.
1561
Method Pearson’s r Spearman’s ρ MSE
Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692
UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550
Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224
ECNU (Zhao et al., 2014) 0.8414 – –
Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090)
DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137)
SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074)
LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092)
Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063)
2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150)
2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020)
Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108)
Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052)
Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean
scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval
2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs.
6 Results tion metrics. The first two metrics are measures of

correlation against human evaluations of semantic
6.1 Sentiment Classification
relatedness.
Our results are summarized in Table 2. The Con- We compare our models against a number of
stituency Tree-LSTM outperforms existing sys- non-LSTM baselines. The mean vector baseline
tems on the fine-grained classification subtask and computes sentence representations as a mean of
achieves accuracy comparable to the state-of-the- the representations of the constituent words. The
art on the binary subtask. In particular, we find that DT-RNN and SDT-RNN models (Socher et al.,
it outperforms the Dependency Tree-LSTM. This 2014) both compose vector representations for the
performance gap is at least partially attributable to nodes in a dependency tree as a sum over affine-
the fact that the Dependency Tree-LSTM is trained transformed child vectors, followed by a nonlin-
on less data: about 150K labeled nodes vs. 319K earity. The SDT-RNN is an extension of the DT-
for the Constituency Tree-LSTM. This difference RNN that uses a separate transformation for each
is due to (1) the dependency representations con- dependency relation. For each of our baselines,
taining fewer nodes than the corresponding con- including the LSTM models, we use the similarity
stituency representations, and (2) the inability to model described in Sec. 4.2.
match about 9% of the dependency nodes to a cor- We also compare against four of the top-
responding span in the training data. performing systems6 submitted to the SemEval
We found that updating the word representa- 2014 semantic relatedness shared task: ECNU
tions during training (“fine-tuning” the word em- (Zhao et al., 2014), The Meaning Factory (Bjerva
bedding) yields a significant boost in performance et al., 2014), UNAL-NLP (Jimenez et al., 2014),
on the fine-grained classification subtask and gives and Illinois-LH (Lai and Hockenmaier, 2014).
a minor gain on the binary classification subtask These systems are heavily feature engineered,
(this finding is consistent with previous work on generally using a combination of surface form
this task by Kim (2014)). These gains are to be overlap features and lexical distance features de-
expected since the Glove vectors used to initial- rived from WordNet or the Paraphrase Database
ize our word representations were not originally (Ganitkevitch et al., 2013).
trained to capture sentiment. Our LSTM models outperform all these sys-
6.2 Semantic Relatedness 6
We list the strongest results we were able to find for this
Our results are summarized in Table 3. Following task; in some cases, these results are stronger than the official
performance by the team on the shared task. For example,
Marelli et al. (2014), we use Pearson’s r, Spear- the listed result by Zhao et al. (2014) is stronger than their
man’s ρ and mean squared error (MSE) as evalua- submitted system’s Pearson correlation score of 0.8280.
1562
0.70 0.90
0.65
0.88
0.60
0.86
0.55
accuracy
0.50 0.84
r
0.45
DT-LSTM
0.82 DT-LSTM
0.40 CT-LSTM CT-LSTM
LSTM 0.80 LSTM
0.35
Bi-LSTM Bi-LSTM
0.30 0.78
0 5 10 15 20 25 30 35 40 45 4 6 8 10 12 14 16 18 20
sentence length mean sentence length
Figure 3: Fine-grained sentiment classification ac- Figure 4: Pearson correlations r between pre-
curacy vs. sentence length. For each `, we plot dicted similarities and gold ratings vs. sentence
accuracy for the test set sentences with length in length. For each `, we plot r for the pairs with
the window [` − 2, ` + 2]. Examples in the tail mean length in the window [`−2, `+2]. Examples
of the length distribution are batched in the final in the tail of the length distribution are batched in
window (` = 45). the final window (` = 18.5).
tems without any additional feature engineering, bustness to differences in sentence length. Given
with the best results achieved by the Dependency the query “two men are playing guitar”, the Tree-
Tree-LSTM. Recall that in this task, both Tree- LSTM associates the phrase “playing guitar” with
LSTM models only receive supervision at the root the longer, related phrase “dancing and singing in
of the tree, in contrast to the sentiment classifi- front of a crowd” (note as well that there is zero
cation task where supervision was also provided token overlap between the two phrases).
at the intermediate nodes. We conjecture that in
7.2 Effect of Sentence Length
this setting, the Dependency Tree-LSTM benefits
from its more compact structure relative to the One hypothesis to explain the empirical strength
Constituency Tree-LSTM, in the sense that paths of Tree-LSTMs is that tree structures help miti-
from input word vectors to the root of the tree gate the problem of preserving state over long se-
are shorter on aggregate for the Dependency Tree- quences of words. If this were true, we would ex-
LSTM. pect to see the greatest improvement over sequen-
tial LSTMs on longer sentences. In Figs. 3 and 4,
7 Discussion and Qualitative Analysis we show the relationship between sentence length
and performance as measured by the relevant task-
7.1 Modeling Semantic Relatedness specific metric. Each data point is a mean score
In Table 4, we list nearest-neighbor sentences re- over 5 runs, and error bars have been omitted for
trieved from a 1000-sentence sample of the SICK clarity.
test set. We compare the neighbors ranked by the We observe that while the Dependency Tree-
Dependency Tree-LSTM model against a baseline LSTM does significantly outperform its sequen-
ranking by cosine similarity of the mean word vec- tial counterparts on the relatedness task for
tors for each sentence. longer sentences of length 13 to 15 (Fig. 4), it
The Dependency Tree-LSTM model exhibits also achieves consistently strong performance on
several desirable properties. Note that in the de- shorter sentences. This suggests that unlike se-
pendency parse of the second query sentence, the quential LSTMs, Tree-LSTMs are able to encode
word “ocean” is the second-furthest word from the semantically-useful structural information in the
root (“waving”), with a depth of 4. Regardless, the sentence representations that they compose.
retrieved sentences are all semantically related to
8 Related Work
the word “ocean”, which indicates that the Tree-
LSTM is able to both preserve and emphasize in- Distributed representations of words (Rumelhart
formation from relatively distant nodes. Addi- et al., 1988; Collobert et al., 2011; Turian et al.,
tionally, the Tree-LSTM model shows greater ro- 2010; Huang et al., 2012; Mikolov et al., 2013;
1563
Ranking by mean word vector cosine similarity Score Ranking by Dependency Tree-LSTM model Score
a woman is slicing potatoes a woman is slicing potatoes
a woman is cutting potatoes 0.96 a woman is cutting potatoes 4.82
a woman is slicing herbs 0.92 potatoes are being sliced by a woman 4.70
a woman is slicing tofu 0.92 tofu is being sliced by a woman 4.39
a boy is waving at some young runners from the ocean a boy is waving at some young runners from the ocean
a man and a boy are standing at the bottom of some stairs , 0.92 a group of men is playing with a ball on the beach 3.79
which are outdoors
a group of children in uniforms is standing at a gate and 0.90 a young boy wearing a red swimsuit is jumping out of a 3.37
one is kissing the mother blue kiddies pool
a group of children in uniforms is standing at a gate and 0.90 the man is tossing a kid into the swimming pool that is 3.19
there is no one kissing the mother near the ocean
two men are playing guitar two men are playing guitar
some men are playing rugby 0.88 the man is singing and playing the guitar 4.08
two men are talking 0.87 the man is opening the guitar for donations and plays 4.01
with the case
two dogs are playing with each other 0.87 two men are dancing and singing in front of a crowd 4.00
Table 4: Most similar sentences from a 1000-sentence sample drawn from the SICK test set. The Tree-
LSTM model is able to pick up on more subtle relationships, such as that between “beach” and “ocean”
in the second example.
Pennington et al., 2014) have found wide appli- and sentiment classification, outperforming exist-
cability in a variety of NLP tasks. Following ing systems on both. Controlling for model di-
this success, there has been substantial interest in mensionality, we demonstrated that Tree-LSTM
the area of learning distributed phrase and sen- models are able to outperform their sequential
tence representations (Mitchell and Lapata, 2010; counterparts. Our results suggest further lines of
Yessenalina and Cardie, 2011; Grefenstette et al., work in characterizing the role of structure in pro-
2013; Mikolov et al., 2013), as well as distributed ducing distributed representations of sentences.
representations of longer bodies of text such as
paragraphs and documents (Srivastava et al., 2013; Acknowledgements
Le and Mikolov, 2014).
We thank our anonymous reviewers for their valu-
Our approach builds on recursive neural net-
able feedback. Stanford University gratefully ac-
works (Goller and Kuchler, 1996; Socher et al.,
knowledges the support of a Natural Language
2011), which we abbreviate as Tree-RNNs in or-
Understanding-focused gift from Google Inc. and
der to avoid confusion with recurrent neural net-
the Defense Advanced Research Projects Agency
works. Under the Tree-RNN framework, the vec-
(DARPA) Deep Exploration and Filtering of Text
tor representation associated with each node of
(DEFT) Program under Air Force Research Lab-
a tree is composed as a function of the vectors
oratory (AFRL) contract no. FA8750-13-2-0040.
corresponding to the children of the node. The
Any opinions, findings, and conclusion or recom-
choice of composition function gives rise to nu-
mendations expressed in this material are those of
merous variants of this basic framework. Tree-
the authors and do not necessarily reflect the view
RNNs have been used to parse images of natu-
of the DARPA, AFRL, or the US government.
ral scenes (Socher et al., 2011), compose phrase
representations from word vectors (Socher et al., References
2012), and classify the sentiment polarity of sen-
tences (Socher et al., 2013). Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua
Bengio. 2015. Neural machine translation by
9 Conclusion jointly learning to align and translate. In Pro-
In this paper, we introduced a generalization of ceedings of the 3rd International Conference on
LSTMs to tree-structured network topologies. The Learning Representations (ICLR 2015).
Tree-LSTM architecture can be applied to trees Bengio, Yoshua, Patrice Simard, and Paolo Fras-
with arbitrary branching factor. We demonstrated coni. 1994. Learning long-term dependencies
the effectiveness of the Tree-LSTM by applying with gradient descent is difficult. IEEE Trans-
the architecture in two tasks: semantic relatedness actions on Neural Networks 5(2).
1564
Bjerva, Johannes, Johan Bos, Rob van der Goot, learning for compositional distributional se-
and Malvina Nissim. 2014. The Meaning Fac- mantics. In Proceedings of the 10th Interna-
tory: Formal semantics for recognizing textual tional Conference on Computational Semantics.
entailment and determining semantic similarity. Hochreiter, Sepp. 1998. The vanishing gradient
In Proceedings of the 8th International Work- problem during learning recurrent neural nets
shop on Semantic Evaluation (SemEval 2014). and problem solutions. International Journal of
Blunsom, Phil, Edward Grefenstette, Nal Kalch- Uncertainty, Fuzziness and Knowledge-Based
brenner, et al. 2014. A convolutional neural net- Systems 6(02):107–116.
work for modelling sentences. In Proceedings Hochreiter, Sepp and Jürgen Schmidhuber. 1997.
of the 52nd Annual Meeting of the Association Long Short-Term Memory. Neural Computa-
for Computational Linguistics. tion 9(8).
Chen, Danqi and Christopher D Manning. 2014. A Huang, Eric H., Richard Socher, Christopher D.
fast and accurate dependency parser using neu- Manning, and Andrew Y. Ng. 2012. Improv-
ral networks. In Proceedings of the 2014 Con- ing word representations via global context and
ference on Empirical Methods in Natural Lan- multiple word prototypes. In Annual Meeting
guage Processing (EMNLP). of the Association for Computational Linguis-
Collobert, Ronan, Jason Weston, Léon Bottou, tics (ACL).
Michael Karlen, Koray Kavukcuoglu, and Pavel Irsoy, Ozan and Claire Cardie. 2014. Deep re-
Kuksa. 2011. Natural language processing (al- cursive neural networks for compositionality in
most) from scratch. The Journal of Machine language. In Advances in Neural Information
Learning Research 12:2493–2537. Processing Systems.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. Jimenez, Sergio, George Duenas, Julia Baquero,
Adaptive subgradient methods for online learn- Alexander Gelbukh, Av Juan Dios Bátiz, and
ing and stochastic optimization. The Journal of Av Mendizábal. 2014. UNAL-NLP: Combin-
Machine Learning Research 12:2121–2159. ing soft cardinality features for semantic textual
Elman, Jeffrey L. 1990. Finding structure in time. similarity, relatedness and entailment. In Pro-
Cognitive science 14(2):179–211. ceedings of the 8th International Workshop on
Foltz, Peter W, Walter Kintsch, and Thomas K Semantic Evaluation (SemEval 2014).
Landauer. 1998. The measurement of textual Kim, Yoon. 2014. Convolutional neural networks
coherence with latent semantic analysis. Dis- for sentence classification. In Proceedings of
course processes 25(2-3):285–307. the 2014 Conference on Empirical Methods in
Ganitkevitch, Juri, Benjamin Van Durme, and Natural Language Processing (EMNLP).
Chris Callison-Burch. 2013. PPDB: The Para- Klein, Dan and Christopher D Manning. 2003.
phrase Database. In Proceedings of HLT- Accurate unlexicalized parsing. In Proceedings
NAACL 2013. of the 41st Annual Meeting on Association for
Goller, Christoph and Andreas Kuchler. 1996. Computational Linguistics.
Learning task-dependent distributed representa- Lai, Alice and Julia Hockenmaier. 2014. Illinois-
tions by backpropagation through structure. In LH: A denotational and distributional approach
IEEE International Conference on Neural Net- to semantics. In Proceedings of the 8th Inter-
works. national Workshop on Semantic Evaluation (Se-
Graves, Alex, Navdeep Jaitly, and A-R Mohamed. mEval 2014).
2013. Hybrid speech recognition with deep Landauer, Thomas K and Susan T Dumais. 1997.
bidirectional LSTM. In IEEE Workshop on Au- A solution to Plato’s problem: The latent se-
tomatic Speech Recognition and Understanding mantic analysis theory of acquisition, induction,
(ASRU). and representation of knowledge. Psychological
Grefenstette, Edward, Georgiana Dinu, Yao- review 104(2):211.
Zhong Zhang, Mehrnoosh Sadrzadeh, and Le, Quoc and Tomas Mikolov. 2014. Distributed
Marco Baroni. 2013. Multi-step regression representations of sentences and documents. In
1565
Proceedings of the 31st International Confer- Socher, Richard, Alex Perelygin, Jean Y Wu,
ence on Machine Learning (ICML-14). Jason Chuang, Christopher D Manning, An-
drew Y Ng, and Christopher Potts. 2013. Re-
Marelli, Marco, Luisa Bentivogli, Marco Ba-
cursive deep models for semantic composition-
roni, Raffaella Bernardi, Stefano Menini, and
ality over a sentiment treebank. In Proceedings
Roberto Zamparelli. 2014. SemEval-2014 Task
of the 2013 Conference on Empirical Methods
1: Evaluation of compositional distributional
in Natural Language Processing (EMNLP).
semantic models on full sentences through se-
mantic relatedness and textual entailment. In Srivastava, Nitish, Geoffrey Hinton, Alex
Proceedings of the 8th International Workshop Krizhevsky, Ilya Sutskever, and Ruslan
on Semantic Evaluation (SemEval 2014). Salakhutdinov. 2014. Dropout: A simple
way to prevent neural networks from overfit-
Mikolov, Tomáš. 2012. Statistical Language Mod-
ting. Journal of Machine Learning Research
els Based on Neural Networks. Ph.D. thesis,
15:1929–1958.
Brno University of Technology.
Srivastava, Nitish, Ruslan Salakhutdinov, and Ge-
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S offrey Hinton. 2013. Modeling documents with
Corrado, and Jeff Dean. 2013. Distributed a Deep Boltzmann Machine. In Uncertainty in
representations of words and phrases and their Artificial Intelligence.
compositionality. In Advances in Neural Infor-
Sutskever, Ilya, Oriol Vinyals, and Quoc VV Le.
mation Processing Systems.
2014. Sequence to sequence learning with neu-
Mitchell, Jeff and Mirella Lapata. 2010. Composi- ral networks. In Advances in Neural Informa-
tion in distributional models of semantics. Cog- tion Processing Systems.
nitive science 34(8):1388–1429.
Turian, Joseph, Lev Ratinov, and Yoshua Bengio.
Pennington, Jeffrey, Richard Socher, and Christo- 2010. Word representations: A simple and gen-
pher D Manning. 2014. Glove: Global vectors eral method for semi-supervised learning. In
for word representation. In Proceedings of the Proceedings of the 48th Annual Meeting of the
2014 Conference on Empiricial Methods in Nat- Association for Computational Linguistics.
ural Language Processing (EMNLP). Vinyals, Oriol, Alexander Toshev, Samy Bengio,
Rumelhart, David E, Geoffrey E Hinton, and and Dumitru Erhan. 2014. Show and tell: A
Ronald J Williams. 1988. Learning represen- neural image caption generator. arXiv preprint
tations by back-propagating errors. Cognitive arXiv:1411.4555 .
modeling 5. Yessenalina, Ainur and Claire Cardie. 2011. Com-
Socher, Richard, Brody Huval, Christopher D positional matrix-space models for sentiment
Manning, and Andrew Y Ng. 2012. Seman- analysis. In Proceedings of the 2011 Confer-
tic compositionality through recursive matrix- ence on Empirical Methods in Natural Lan-
vector spaces. In Proceedings of the 2012 Joint guage Processing (EMNLP).
Conference on Empirical Methods in Natural Zaremba, Wojciech and Ilya Sutskever.
Language Processing and Computational Nat- 2014. Learning to execute. arXiv preprint
ural Language Learning (EMNLP). arXiv:1410.4615 .
Socher, Richard, Andrej Karpathy, Quoc V Le, Zhao, Jiang, Tian Tian Zhu, and Man Lan. 2014.
Christopher D Manning, and Andrew Y Ng. ECNU: One stone two birds: Ensemble of het-
2014. Grounded compositional semantics for erogenous measures for semantic relatedness
finding and describing images with sentences. and textual entailment. In Proceedings of the
Transactions of the Association for Computa- 8th International Workshop on Semantic Evalu-
tional Linguistics 2. ation (SemEval 2014).
Socher, Richard, Cliff C Lin, Chris Manning, and
Andrew Y Ng. 2011. Parsing natural scenes
and natural language with recursive neural net-
works. In Proceedings of the 28th International
Conference on Machine Learning (ICML-11).
1566

Improves Semantics PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Improves Semantics PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Improved Semantic Representations From

Tree-Structured Long Short-Term Memory Networks

Kai Sheng Tai, Richard Socher*, Christopher D. Manning

Because of their superior ability to pre-

4.2 Semantic Relatedness of Sentence Pairs 5 Experiments

rized constituency parse trees are provided for

6 Results tion metrics. The first two metrics are measures of

Das könnte Ihnen auch gefallen