Beruflich Dokumente
Kultur Dokumente
Zichao Yang1 , Diyi Yang1 , Chris Dyer1 , Xiaodong He2 , Alex Smola1 , Eduard Hovy1
1
Carnegie Mellon University, 2 Microsoft Research, Redmond
{zichaoy, diyiy, cdyer, hovy}@cs.cmu.edu
xiaohe@microsoft.com alex@smola.org
LSTM takes the whole document as a single se- 3.4 Results and analysis
quence and the average of the hidden states of
all words is used as feature for classification. The experimental results on all data sets are shown
Conv-GRNN and LSTM-GRNN were proposed in Table 2. We refer to our models as HN-{AVE,
by (Tang et al., 2015). They also explore MAX, ATT}. Here HN stands for Hierarchical
the hierarchical structure: a CNN or LSTM Network, AVE indicates averaging, MAX indicates
provides a sentence vector, and then a gated max-pooling, and ATT indicates our proposed hi-
recurrent neural network (GRNN) combines erarchical attention model. Results show that HN-
the sentence vectors from a document level ATT gives the best performance across all data sets.
vector representation for classification. The improvement is regardless of data sizes. For
smaller data sets such as Yelp 2013 and IMDB, our
3.3 Model configuration and training model outperforms the previous best baseline meth-
We split documents into sentences and tokenize each ods by 3.1% and 4.1% respectively. This finding is
sentence using Stanfords CoreNLP (Manning et al., consistent across other larger data sets. Our model
2014). We only retain words appearing more than outperforms previous best models by 3.2%, 3.4%,
5 times in building the vocabulary and replace the 4.6% and 6.0% on Yelp 2014, Yelp 2015, Yahoo An-
words that appear 5 times with a special UNK token. swers and Amazon Reviews. The improvement also
We obtain the word embedding by training an un- occurs regardless of the type of task: sentiment clas-
supervised word2vec (Mikolov et al., 2013) model sification, which includes Yelp 2013-2014, IMDB,
on the training and validation splits and then use the Amazon Reviews and topic classification for Yahoo
word embedding to initialize We . Answers.
The hyper parameters of the models are tuned From Table 2 we can see that neural network
on the validation set. In our experiments, we set based methods that do not explore hierarchical doc-
the word embedding dimension to be 200 and the ument structure, such as LSTM, CNN-word, CNN-
GRU dimension to be 50. In this case a com- char have little advantage over traditional methods
bination of forward and backward GRU gives us for large scale (in terms of document size) text clas-
100 dimensions for word/sentence annotation. The sification. E.g. SVM+TextFeatures gives perfor-
word/sentence context vectors also have a dimension mance 59.8, 61.8, 62.4, 40.5 for Yelp 2013, 2014,
of 100, initialized at random. 2015 and IMDB respectively, while CNN-word has
For training, we use a mini-batch size of 64 and accuracy 59.7, 61.0, 61.5, 37.6 respectively.
documents of similar length (in terms of the number Exploring the hierarchical structure only, as in
of sentences in the documents) are organized to be a HN-AVE, HN-MAX, can significantly improve over
batch. We find that length-adjustment can accelerate LSTM, CNN-word and CNN-char. For exam-
training by three times. We use stochastic gradient ple, our HN-AVE outperforms CNN-word by 7.3%,
descent to train all models with momentum of 0.9. 8.8%, 8.5%, 10.2% than CNN-word on Yelp 2013,
We pick the best learning rate using grid search on 2014, 2015 and IMDB respectively. Our model
the validation set. HN-ATT that further utilizes attention mechanism
Methods Yelp13 Yelp14 Yelp15 IMDB Yahoo Answer Amazon
combined with hierarchical structure improves over good. To verify that our model can capture context
previous models (LSTM-GRNN) by 3.1%, 3.4%, dependent word importance, we plot the distribution
3.5% and 4.1% respectively. More interestingly, of the attention weights of the words good and bad
in the experiments, HN-AVE is equivalent to us- from the test split of Yelp 2013 data set as shown in
ing non-informative global word/sentence context Figure 3(a) and Figure 4(a). We can see that the dis-
vectors (e.g., if they are all-zero vectors, then the tribution has a attention weight assigned to a word
attention weights in Eq. 6 and 9 become uniform from 0 to 1. This indicates that our model captures
weights). Compared to HN-AVE, the HN-ATT diverse context and assign context-dependent weight
model gives superior performance across the board. to the words.
This clearly demonstrates the effectiveness of the
proposed global word and sentence importance vec-
tors for the HAN. For further illustration, we plot the distribution
when conditioned on the ratings of the review. Sub-
3.5 Context dependent attention weights figures 3(b)-(f) in Figure 3 and Figure 4 correspond
If words were inherently important or not important, to the rating 1-5 respectively. In particular, Fig-
models without attention mechanism might work ure 3(b) shows that the weight of good concentrates
well since the model could automatically assign low on the low end in the reviews with rating 1. As
weights to irrelevant words and vice versa. How- the rating increases, so does the weight distribution.
ever, the importance of words is highly context de- This means that the word good plays a more im-
pendent. For example, the word good may appear portant role for reviews with higher ratings. We can
in a review that has the lowest rating either because observe the converse trend in Figure 4 for the word
users are only happy with part of the product/service bad. This confirms that our model can capture the
or because they use it in a negation, such as not context-dependent word importance.
(a) (b) 2.0 (c) weight. Due to the hierarchical structure, we nor-
1.2 2.0 malize the word weight by the sentence weight to
1.5
0.9 1.5 make sure that only important words in important
1.0 sentences are emphasized. For visualization pur-
0.6 1.0
0.3 0.5 poses we display ps pw . The ps term displays the
0.5
important words in unimportant sentences to ensure
0.0 0.0 0.0
0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9 that they are not totally invisible.
1.6 (d) 1.6 (e) (f)
1.2 Figure 5 shows that our model can select the
1.2 1.2 words carrying strong sentiment like delicious,
0.9
0.8 0.8 amazing, terrible and their corresponding
0.6 sentences. Sentences containing many words
0.4 0.4 0.3 like cocktails, pasta, entree are disre-
0.0 0.0 0.0 garded. Note that our model can not only select
0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9
words carrying strong sentiment, it can also deal
Figure 3: Attention weight distribution of good. (a) aggre- with complex across-sentence context. For example,
gate distribution on the test split; (b)-(f) stratified for reviews there are sentences like i dont even like
with ratings 1-5 respectively. We can see that the weight distri- scallops in the first document of Fig. 5, if look-
bution shifts to higher end as the rating goes higher. ing purely at the single sentence, we may think this
(a) (b) (c) is negative comment. However, our model looks at
2.4
1.6 1.2 the context of this sentence and figures out this is a
1.8 1.2 0.9 positive review and chooses to ignore this sentence.
1.2 0.8 0.6 Our hierarchical attention mechanism also works
0.6 0.4 0.3 well for topic classification in the Yahoo Answer
0.0 0.0 0.0
data set. For example, for the left document
0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9
(d) (e) (f) in Figure 6 with label 1, which denotes Science
3.2 and Mathematics, our model accurately localizes
2.8 2.8
2.4 the words zebra, strips, camouflage,
2.1 2.1
predator and their corresponding sentences. For
1.6 1.4 1.4 the right document with label 4, which denotes
0.8 0.7 0.7 Computers and Internet, our model focuses on web,
0.0 0.0 0.0 searches, browsers and their corresponding
0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9 0.0 0.3 0.6 0.9
sentences. Note that this happens in a multiclass set-
Figure 4: Attention weight distribution of the word bad. The ting, that is, detection happens before the selection
setup is as above: (a) contains the aggregate distribution, while of the topic!
(b)-(f) contain stratifications to reviews with ratings 1-5 respec-
tively. Contrary to before, the word bad is considered impor- 4 Related Work
tant for poor ratings and less so for good ones.
Kim (2014) use neural networks for text classifi-
cation. The architecture is a direct application of
3.6 Visualization of attention
CNNs, as used in computer vision (LeCun et al.,
In order to validate that our model is able to select in- 1998), albeit with NLP interpretations. Johnson and
formative sentences and words in a document, we vi- Zhang (2014) explores the case of directly using
sualize the hierarchical attention layers in Figures 5 a high-dimensional one hot vector as input. They
and 6 for several documents from the Yelp 2013 and find that it performs well. Unlike word level mod-
Yahoo Answers data sets. elings, Zhang et al. (2015) apply a character-level
Every line is a sentence (sometimes sentences CNN for text classification and achieve competitive
spill over several lines due to their length). Red de- results. Socher et al. (2013) use recursive neural
notes the sentence weight and blue denotes the word networks for text classification. Tai et al. (2015)
GT: 0 Prediction: 0
GT: 4 Prediction: 4
terrible value .
pork belly = delicious .
ordered pasta entree .
scallops ?
.
i do nt . $ 16.95 good taste but size was an
even .
like . appetizer size .
scallops , and these were a-m-a-z-i-n-g . .
no salad , no bread no vegetable .
fun and tasty cocktails .
this was .
next time i m in phoenix , i will go
our and tasty cocktails .
back here .
our second visit .
highly recommend .
i will not go back .
Figure 5: Documents from Yelp 2013. Label 4 means star 5, label 0 means star 1.
GT: 1 Prediction: 1
GT: 4 Prediction: 4
why does zebras have stripes ?
how do i get rid of all the old web
what is the purpose or those stripes ?
searches i have on my web browser ?
who do they serve the zebras in the
i want to clean up my web browser
wild life ?
go to tools > options .
this provides camouflage - predator
then click delete history and
vision is such that it is usually difficult
clean up temporary internet files .
for them to see complex patterns
Figure 6: Documents from Yahoo Answers. Label 1 denotes Science and Mathematics and label 4 denotes Computers and Internet.
explore the structure of a sentence and use a tree- Kumar et al., 2015; Hermann et al., 2015), and im-
structured LSTMs for classification. There are also age question answering (Yang et al., 2015). Un-
some works that combine LSTM and CNN struc- like these works, we explore a hierarchical attention
ture to for sentence classification (Lai et al., 2015; mechanism (to the best of our knowledge this is the
Zhou et al., 2015). Tang et al. (2015) use hierarchi- first such instance).
cal structure in sentiment classification. They first
use a CNN or LSTM to get a sentence vector and 5 Conclusion
then a bi-directional gated recurrent neural network In this paper, we proposed hierarchical attention net-
to compose the sentence vectors to get a document works (HAN) for classifying documents. As a con-
vectors. There are some other works that use hier- venient side-effect we obtained better visualization
archical structure in sequence generation (Li et al., using the highly informative components of a doc-
2015) and language modeling (Lin et al., 2015). ument. Our model progressively builds a document
The attention mechanism was proposed by (Bah- vector by aggregating important words into sentence
danau et al., 2014) in machine translation. The en- vectors and then aggregating important sentences
coder decoder framework is used and an attention vectors to document vectors. Experimental results
mechanism is used to select the reference words in demonstrate that our model performs significantly
original language for words in foreign language be- better than previous methods. Visualization of these
fore translation. Xu et al. (2015) uses the attention attention layers illustrates that our model is effective
mechanism in image caption generation to select the in picking out important words and sentences.
relevant image regions when generating words in the
captions. Further uses of the attention mechanism Acknowledgments This work was supported by
include parsing (Vinyals et al., 2014), natural lan- Microsoft Research.
guage question answering (Sukhbaatar et al., 2015;
References document recognition. Proceedings of the IEEE,
86(11):22782324.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.
gio. 2014. Neural machine translation by jointly
A hierarchical neural autoencoder for paragraphs and
learning to align and translate. arXiv preprint
documents. arXiv preprint arXiv:1506.01057.
arXiv:1409.0473.
Phil Blunsom, Edward Grefenstette, Nal Kalchbrenner, Rui Lin, Shujie Liu, Muyun Yang, Mu Li, Ming Zhou,
et al. 2014. A convolutional neural network for mod- and Sheng Li. 2015. Hierarchical recurrent neural
elling sentences. In Proceedings of the 52nd Annual network for document modeling. In Proceedings of
Meeting of the Association for Computational Linguis- the 2015 Conference on Empirical Methods in Natural
tics. Proceedings of the 52nd Annual Meeting of the Language Processing, pages 899907.
Association for Computational Linguistics. Wang Ling, Tiago Lus, Lus Marujo, Ramon Fernan-
Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexan- dez Astudillo, Silvio Amir, Chris Dyer, Alan W
der J Smola, Jing Jiang, and Chong Wang. 2014. Black, and Isabel Trancoso. 2015. Finding func-
Jointly modeling aspects, ratings and sentiments for tion in form: Compositional character models for
movie recommendation (jmars). In Proceedings of open vocabulary word representation. arXiv preprint
the 20th ACM SIGKDD international conference on arXiv:1508.02096.
Knowledge discovery and data mining, pages 193 Andrew L Maas, Raymond E Daly, Peter T Pham, Dan
202. ACM. Huang, Andrew Y Ng, and Christopher Potts. 2011.
Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong Learning word vectors for sentiment analysis. In Pro-
He, Li Deng, and Yelong Shen. 2014. Modeling inter- ceedings of the 49th Annual Meeting of the Associa-
estingness with deep neural networks. In Proceedings tion for Computational Linguistics: Human Language
of the 2013 Conference on Empirical Methods in Nat- Technologies-Volume 1, pages 142150. Association
ural Language Processing. for Computational Linguistics.
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- Christopher D Manning, Mihai Surdeanu, John Bauer,
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, Jenny Finkel, Steven J Bethard, and David McClosky.
and Phil Blunsom. 2015. Teaching machines to read 2014. The stanford corenlp natural language process-
and comprehend. arXiv preprint arXiv:1506.03340. ing toolkit. In Proceedings of 52nd Annual Meeting of
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long the Association for Computational Linguistics: System
short-term memory. Neural computation, 9(8):1735 Demonstrations, pages 5560.
1780. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Thorsten Joachims. 1998. Text categorization with sup- rado, and Jeff Dean. 2013. Distributed representa-
port vector machines: Learning with many relevant tions of words and phrases and their compositionality.
features. Springer. In Advances in neural information processing systems,
Rie Johnson and Tong Zhang. 2014. Effective use of pages 31113119.
word order for text categorization with convolutional Bo Pang and Lillian Lee. 2008. Opinion mining and
neural networks. arXiv preprint arXiv:1412.1058. sentiment analysis. Foundations and trends in infor-
Yoon Kim. 2014. Convolutional neural networks for sen- mation retrieval, 2(1-2):1135.
tence classification. arXiv preprint arXiv:1408.5882. Mehran Sahami, Susan Dumais, David Heckerman, and
Svetlana Kiritchenko, Xiaodan Zhu, and Saif M Moham- Eric Horvitz. 1998. A bayesian approach to filter-
mad. 2014. Sentiment analysis of short informal texts. ing junk e-mail. In Learning for Text Categorization:
Journal of Artificial Intelligence Research, pages 723 Papers from the 1998 workshop, volume 62, pages 98
762. 105.
Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and
Robert English, Brian Pierce, Peter Ondruska, Ishaan Gregoire Mesnil. 2014. A latent semantic model
Gulrajani, and Richard Socher. 2015. Ask me any- with convolutional-pooling structure for information
thing: Dynamic memory networks for natural lan- retrieval. In Proceedings of the 23rd ACM Interna-
guage processing. arXiv preprint arXiv:1506.07285. tional Conference on Conference on Information and
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Knowledge Management, pages 101110. ACM.
Recurrent convolutional neural networks for text clas- Richard Socher, Alex Perelygin, Jean Y. Wu, Jason
sification. In Twenty-Ninth AAAI Conference on Arti- Chuang, Christopher D. Manning, Andrew Y. Ng, and
ficial Intelligence. Christopher Potts. 2013. Recursive deep models for
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick semantic compositionality over a sentiment treebank.
Haffner. 1998. Gradient-based learning applied to In Proc. EMNLP.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and
Rob Fergus. 2015. End-to-end memory networks.
arXiv preprint arXiv:1503.08895.
Kai Sheng Tai, Richard Socher, and Christopher D. Man-
ning. 2015. Improved semantic representations from
tree-structured long short-term memory networks. In
Proc. ACL.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu,
and Bing Qin. 2014. Learning sentiment-specific
word embedding for twitter sentiment classification.
In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics, volume 1,
pages 15551565.
Duyu Tang, Bing Qin, and Ting Liu. 2015. Document
modeling with gated recurrent neural network for sen-
timent classification. In Proceedings of the 2015 Con-
ference on Empirical Methods in Natural Language
Processing, pages 14221432.
Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov,
Ilya Sutskever, and Geoffrey Hinton. 2014.
Grammar as a foreign language. arXiv preprint
arXiv:1412.7449.
Sida Wang and Christopher D Manning. 2012. Baselines
and bigrams: Simple, good sentiment and topic classi-
fication. In Proceedings of the 50th Annual Meeting of
the Association for Computational Linguistics: Short
Papers-Volume 2, pages 9094. Association for Com-
putational Linguistics.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville,
Ruslan Salakhutdinov, Richard Zemel, and Yoshua
Bengio. 2015. Show, attend and tell: Neural im-
age caption generation with visual attention. arXiv
preprint arXiv:1502.03044.
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,
and Alex Smola. 2015. Stacked attention net-
works for image question answering. arXiv preprint
arXiv:1511.02274.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. arXiv preprint arXiv:1509.01626.
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis
Lau. 2015. A c-lstm neural network for text classifi-
cation. arXiv preprint arXiv:1511.08630.