Sie sind auf Seite 1von 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.


Refining Word Embeddings for Sentiment Analysis

Conference Paper · January 2017

DOI: 10.18653/v1/D17-1056


41 259

4 authors:

Liang-Chih Yu Jin Wang

Yuan Ze University Yunnan University


K. Robert Lai Xuejie Zhang

Yuan Ze University Yunnan University


Some of the authors of this publication are also working on these related projects:

Chinese Grammatical Error Diagnosis View project

Cross-Lingual Dimensional Sentiment Analysis View project

All content following this page was uploaded by K. Robert Lai on 04 March 2018.

The user has requested enhancement of the downloaded file.

Refining Word Embeddings for Sentiment Analysis

Liang-Chih Yu1,3, Jin Wang2,3,4, K. Robert Lai2,3 and Xuejie Zhang4

Department of Information Management, Yuan Ze University, Taiwan
Department of Computer Science & Engineering, Yuan Ze University, Taiwan
Innovation Center for Big Data and Digital Convergence Yuan Ze University, Taiwan
School of Information Science and Engineering, Yunnan University, Yunnan, P.R. China

Abstract 1995) are also useful information for learning

word embeddings. These embeddings have been
Word embeddings that can capture seman- successfully used for various natural language
tic and syntactic information from contexts processing tasks.
have been extensively used for various
In general, existing word embeddings are se-
natural language processing tasks. Howev-
er, existing methods for learning context- mantically oriented. They can capture semantic
based word embeddings typically fail to and syntactic information from unlabeled data in
capture sufficient sentiment information. an unsupervised manner but fail to capture suffi-
This may result in words with similar vec- cient sentiment information. This makes it diffi-
tor representations having an opposite sen- cult to directly apply existing word embeddings to
timent polarity (e.g., good and bad), thus sentiment analysis. Prior studies have reported
degrading sentiment analysis performance. that words with similar vector representations
Therefore, this study proposes a word vec- (similar contexts) may have opposite sentiment
tor refinement model that can be applied to polarities, as in the example of happy-sad men-
any pre-trained word vectors (e.g.,
tioned in (Mohammad et al., 2013) and good-bad
Word2vec and GloVe). The refinement
model is based on adjusting the vector rep- in (Tang et al., 2016). Composing these word vec-
resentations of words such that they can be tors may produce sentence vectors with similar
closer to both semantically and sentimen- vector representations but opposite sentiment po-
tally similar words and further away from larities (e.g., a sentence containing happy and a
sentimentally dissimilar words. Experi- sentence containing sad may have similar vector
mental results show that the proposed representations). Building on such ambiguous
method can improve conventional word vectors will affect sentiment classification per-
embeddings and outperform previously formance.
proposed sentiment embeddings for both To enhance the performance of distinguishing
binary and fine-grained classification on
words with similar vector representations but op-
Stanford Sentiment Treebank (SST).
posite sentiment polarities, recent studies have
suggested learning sentiment embeddings from
1 Introduction
labeled data in a supervised manner (Maas et al.,
Word embeddings are a technique to learn con- 2011; Labutov and Lipson, 2013; Lan et al., 2016;
tinuous low-dimensional vector space representa- Ren et al., 2016; Tang et al., 2016). The common
tions of words by leveraging the contextual in- goal of these methods is to capture both seman-
formation from large corpora. Examples include tic/syntactic and sentiment information such that
C&W (Collobert and Weston, 2008; Collobert et sentimentally similar words have similar vector
al., 2011), Word2vec (Mikolov et al., 2013a; representations. They typically apply an objective
2013b) and GloVe (Pennington et al., 2014). In function to optimize word vectors based on the
addition to the contextual information, character- sentiment polarity labels (e.g., positive and nega-
level subwords (Bojanowski et al., 2016) and se- tive) given by the training instances. The use of
mantic knowledge resources (Faruqui et al., 2015; such sentiment embeddings has improved the per-
Kiela et al., 2015) such as WordNet (Miller, formance of binary sentiment classification.

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 534–539
Copenhagen, Denmark, September 7–11, 2017. 2017
c Association for Computational Linguistics
This study adopts another strategy to obtain Target word: good (7.89)
both semantic and sentiment word vectors. Instead Ranked by Ranked by
cosine similarity sentiment score
of building a new word embedding model from great (7.50) excellent (7.56)
labeled data, we propose a word vector refinement bad (3.24) great (7.50)
model to refine existing semantically oriented terrific (7.12) fantastic (8.36)
word vectors using sentiment lexicons. That is, the Re-ranking
decent (6.27) wonderful (7.41)
proposed model can be applied to the pre-trained nice (6.95) terrific (7.12)
vectors obtained by any word representation excellent (7.56) nice (6.95)
learning models (e.g., Word2vec and GloVe) as a fantastic (8.36) decent (6.27)
post-processing step to adapt the pre-trained vec- solid (5.65) solid (5.65)
tors to sentiment applications. The refinement lousy (3.14) bad (3.24)
model is based on adjusting the pre-trained vector wonderful (7.41) lousy (3.14)
of each affective word in a given sentiment lexi-
con such that it can be closer to a set of both se- Figure 1: Example of nearest neighbor rank-
mantically and sentimentally similar nearest ing.
neighbors (i.e., those with the same polarity) and
further away from sentimentally dissimilar neigh-
(target word) and the other words in the lexicon
bors (i.e., those with an opposite polarity).
based on the cosine distance of their pre-trained
The proposed refinement model is evaluated by
vectors, and then select top-k most similar words
examining whether our refined embeddings can
as the nearest neighbors. These semantically simi-
improve conventional word embeddings and out-
lar nearest neighbors are then re-ranked according
perform previously proposed sentiment embed-
to their sentiment scores provided by the lexicon
dings. To this end, several deep neural network
such that the sentimentally similar neighbors can
classifiers that performed well on the Stanford
be ranked higher and dissimilar neighbors lower.
Sentiment Treebank (SST) (Socher et al., 2013)
Finally, the pre-trained vector of the target word is
are selected, including convolutional neural net-
refined to be closer to its semantically and senti-
works (CNN) (Kim, 2014), deep averaging net-
mentally similar nearest neighbors and further
work (DAN) (Iyyer et al., 2015) and long-short
away from sentimentally dissimilar neighbors.
term memory (LSTM) (Tai et al., 2015; Looks et
The following subsections provide a detailed de-
al., 2017). The conventional word embeddings
scription of the nearest neighbor ranking and re-
used in these classifiers are then replaced by our
finement model.
refined versions and previously proposed senti-
ment embeddings to re-run the classification for 2.1 Nearest Neighbor Ranking
performance comparison. The SST is chosen be-
cause it can show the effect of using different The sentiment lexicon used in this study is the ex-
word embeddings on fine-grained sentiment clas- tended version of Affective Norms of English
sification, whereas prior studies only reported bi- Words (E-ANEW) (Warriner et al., 2013). It con-
nary classification results. tains 13,915 words and each word is associated
The rest of this paper is organized as follows. with a real-valued score in [1, 9] for the dimen-
Section 2 describes the proposed word vector re- sions of valence, arousal and dominance. The va-
finement model. Section 3 presents the evaluation lence represents the degree of positive and nega-
results. Conclusions are drawn in Section 4. tive sentiment, where values of 1, 5 and 9 respec-
tively denote most negative, neutral and most pos-
2 Word Vector Refinement itive sentiment. In Fig. 1, good has a valence score
of 7.89, which is greater than 5, and thus can be
The refinement procedure begins by giving a set considered positive. Conversely, bad has a va-
of pre-trained word vectors and a sentiment lexi- lence score of 3.24 and is thus negative. In addi-
con annotated with real-valued sentiment scores. tion to the E-ANEW, other lexicons such as Sen-
Our goal is to refine the pre-trained vectors of the tiWordNet (Esuli and Fabrizio, 2006), SoCal
affective words in the lexicon such that they can (Taboada et al., 2011), SentiStrength (Thelwall et
capture both semantic and sentiment information. al., 2012), Vader (Hutto et al., 2014), ANTUSD
To accomplish this goal, we first calculate the se- (Wang and Ku, 2016) and SCL-NMA
mantic similarity between each affective word (Kiritchenko and Mohammad, 2016) also provide

real-valued sentiment intensity or strength scores
like the valence scores.
For each target word to be refined, the top-k
semantically similar nearest neighbors are first se-
lected and ranked in descending order of their co-
sine similarities. In Fig. 1, the left ranked list
shows the top 10 nearest neighbors for the target
word good. The semantically ranked list is then
sentimentally re-ranked based on the absolute dif-
ference of the valence scores between the target
word and the words in the list. A smaller differ-
ence indicates that the word is more sentimentally
similar to the target word, and thus will be ranked
higher. As shown in the right ranked list in Fig. 1, Figure 2: Conceptual diagram of word vector
the re-ranking step can rank the sentimentally refinement.
similar neighbors higher and the dissimilar neigh-
bors lower. In the refinement model, the higher
ranked sentimentally similar neighbors will re- similar neighbors and further away from lower-
ceive a higher weight to refine the pre-trained vec- ranked dissimilar neighbors, as shown in Fig. 2.
tor of the target word. To prevent too many words being moved to
the same location and thereby producing too
2.2 Refinement Model many similar vectors, we add a constraint to keep
each pre-trained vector within a certain range
Once the word list ranked by both cosine similari-
from its original vector. The objective function is
ty and valence scores for each target word is ob-
thus divided as two parts:
tained, its pre-trained vector will be refined to be
(1) closer to its sentimentally similar neighbors, arg min Φ (V )=
(2) further away from its dissimilar neighbors, and n  k  (2)
(3) not too far away from the original vector. arg min ∑ α dist (vit +1 , vit ) + β ∑ wij dist (vit +1 , v tj ) 
Let V = {v1, v2, …, vn} be a set of the pre-
=i 1 =  j 1 
trained vectors corresponding to the affective where dist (vit +1 , vit ) denotes the distance between
words in the sentiment lexicon. For each target to the vector of the target word in step t and t+1, i.e.,
be refined, the refinement model iteratively mini- the distance between the refined vector and its
mizes the distance between the target word and its original vector. The later one represents the dis-
top-k nearest neighbors. The objective function tance between the vector of the target word and
Φ(V) can thus be defined as that of its neighbors (similar to Eq. (1)). The pa-
n k rameters α and β together are used as a ratio to
Φ (V ) = ∑∑
=i 1 =j 1
wij dist (vi , v j ) (1) control how far the refined vector can be moved
away from its original vector and toward its near-
where n denotes the total number of vectors in V est neighbors. A greater ratio indicates a stronger
to be refined, vi denotes the vector of a target constraint on keeping the refined vector closer to
word, vj denotes the vector of one of its nearest its original vector. For the extreme case of α=1
neighbors in the ranked list, dist(vi, vj) denotes the and β=0, the target word will not be moved (re-
distance between vi and vj, and wij denotes the fined). As the ratio decreases, the constraint de-
weight of the target word’s nearest neighbor, de- creases accordingly and the refined vector can be
fined as the reciprocal rank of a ranked list. For moved closer to its nearest neighbors. The setting
example, excellent in Fig. 1 will receive a weight of α=0 and β=1 means that the constraint is disa-
of 1, great will receive a weight of 1/2, and so on. bled.
A word ranked higher will receive a higher To facilitate the calculation of the partial de-
weight. This weight is used to control the move- rivative of Φ(V), dist(vi, vj) in the above equa-
ment direction of the target word towards to its tions is measured by the squared Euclidean dis-
nearest neighbors. That is, the target word will be tance, defined as
moved closer to the higher-ranked sentimentally

 Re(Glove) and Re(Word2vec): Both the pre-
dist (v=
i ,vj ) ∑ (v
d =1
i − v dj ) 2 (3) trained GloVe and Word2vec were refined
using E-ANEW (Warriner et al., 2013). Each
where D is the dimensionality of the word vectors. affective word was refined by its top k=10
The global optimal solution of Φ(V) can be found nearest neighbors with parameters of α:β=0.1
by using an iterative update method. To do so, we (1:10) (see Eq. (2)).
solve the partial derivation of Eq. (2) in step t with
 HyRank: It was trained using SST, NRC
respect to word vector vit , and by setting
Sentiment140 and IMDB datasets. We com-
∂Φ (V ) pared this method because its code is public-
= 0 to obtain a new vector vit +1 in step
∂vit ly accessible3.
t+1. The iterative update procedure is defined as Classifiers. The above word embeddings were
k used by CNN (Kim, 2014) 4, DAN (Iyyer et al.,
γ vit + β ∑ wij vtj 2015) 5, , bi-directional LSTM (Bi-LSTM) (Tai et
j =1
vit +1 = k
(4) al., 2015) 6 and Tree-LSTM (Looks et al., 2017) 7
γ + β ∑ wij with default parameter values.
j =1
Comparative Results. Table 1 presents the eval-
Through the iterative procedure, the vector uation results of using different word embeddings
representation of each target word will be for different classifiers. For the pre-trained word
iteratively updated until the change of the location embeddings, GloVe outperformed Word2vec for
of the target word’s vector is converged. The DAN, Bi-LSTM and Tree-LSTM, whereas
refinement process will be terminated when all Word2vec yielded better performance for CNN.
target words are refined. After the proposed refinement model was applied,
both the pre-trained Word2vec and GloVe were
3 Experimental Results improved. The Re(Word2vec) and Re(GloVe) re-
spectively improved Word2vec and GloVe by
This section evaluates the proposed refinement 1.7% and 1.5% averaged over all classifiers for
model, conventional word embeddings and previ- binary classification, and both 1.6% for fine-
ously proposed sentiment embeddings using sev- grained classification. In addition, both Re(GloVe)
eral deep neural network models for binary and and Re(Word2vec) outperformed the sentiment
fine-grained sentiment classification. embeddings HyRank for all classifiers on both bi-
Dataset. SST was adopted as the evaluation cor- nary and fine-grained classification, indicating
pus (Socher et al., 2013). The binary classification that the real-valued intensity scores used by the
subtask (positive and negative) contains proposed refinement model are more effective
6920/872/1821 samples for the train/dev/test sets, than the binary polarity labels used by the previ-
while the fine-grained ordinal classification sub- ously proposed sentiment embedings.
task (very negative, negative, neutral, positive, The proposed method yielded better perfor-
and very positive) contains 8544/1101/2210 sam- mance because it can remove semantically similar
ples of the train/dev/test sets. but sentimentally dissimilar nearest neighbors for
the target words by refining their vector represen-
Word Embeddings. The word embeddings used
tations. To demonstrate the effect, we define a
for comparison included two conventional word
measure noise@k to calculate the percentage of
embeddings (GloVe and Word2vec), our refined
versions (Re(GloVe) and Re(Word2vec)), and top k nearest neighbors with an opposite polarity
previously proposed sentiment embeddings (Hy- (i.e., noise) to each word in E-ANEW. For in-
Rank) (Tang et al., 2016). We used the same di- stance, in Fig. 1, the noise@10 for good is 20%
mensionality of 300 for all word embeddings. because there are two words with an opposite po-
larity to good among its top 10 nearest neighbors.
 GloVe and Word2vec: The respective GloVe Table 2 shows the average noise@10 for different
and Word2vec (skip-gram) were pre-trained
on Common Crawl 840B 1 and Google- 3
News 2. 4
1 6
2 7

Method Fine-grained Binary Word Embeddings Noise@10 (%)
DAN Word2vec 24.3
- Word2vec 46.2 84.5 GloVe 24.0
- GloVe 46.9 85.7 HyRank 18.5
- Re(Word2vec) 48.1 87.0 Re(Word2vec) 14.4
- Re(GloVe) 48.3 87.3 Re(GloVe) 13.8
- HyRank 47.2 86.6 Table 2: Average percentages of noisy words
CNN in the top 10 nearest neighbors for different
- Word2vec 48.0 87.2 word embeddings.
- GloVe 46.4 85.7
- Re(Word2vec) 48.8 87.9
Experiments on SST show that the proposed
- Re(GloVe) 47.7 87.5
method yielded better performance than both con-
- HyRank 47.3 87.6
ventional word embeddings and sentiment em-
Bi-LSTM beddings for both binary and fine-grained senti-
- Word2vec 48.8 86.3 ment classification. In addition, the performances
- GloVe 49.1 87.5 of various deep neural network models have also
- Re(Word2vec) 49.6 88.2 been improved. Future work will evaluate the
- Re(GloVe) 49.7 88.6 proposed method on another datasets. More ex-
- HyRank 49.0 87.3 periments will also be conducted to provide more
Tree-LSTM in-depth analysis.
- Word2vec 48.8 86.7
- GloVe 51.8 89.1
- Re(Word2vec) 50.1 88.3 This work was supported by the Ministry of Sci-
- Re(GloVe) 54.0 90.3 ence and Technology, Taiwan, ROC, under Grant
- HyRank 49.2 88.2 No. MOST 105-2221-E-155-059-MY2 and
MOST 105-2218-E-006-028. The authors would
Table 1: Accuracy of different classifiers with like to thank the anonymous reviewers and the ar-
different word embeddings for binary and fine- ea chairs for their constructive comments.
grained classification.
word embeddings. For the two semantic-oriented Piotr Bojanowski, Edouard Grave, Armand Joulin,
and Tomas Mikolov. 2016. Enriching word vectors
word vectors, GloVe and Word2vec, on average
with subword information. arXiv preprint arXiv:
around 24% of the top 10 nearest neighbors for 1607.04606.
each word are noisy words. After refinement, both Ronan Collobert and Jason Weston. 2008. A unified
Re(GloVe) and Re(Word2vec) can reduce architecture for natural language processing: Deep
noise@10 to around 14%. The HyRank also neural networks with multitask learning. In
yielded better performance than both GloVe and Proceedings of the 25th International Conference
Word2vec. on Machine Learning (ICML-08), pages 160–167.
Ronan Collobert, Jason Weston, L´eon Bottou,
4 Conclusion Michael Karlen, Koray Kavukcuoglu, and Pavel
Kuksa. 2011. Natural language processing (almost)
This study presents a word vector refinement from scratch. Journal of Machine Learning
model that requires no labeled corpus and can be Research, 12:2493–2537.
applied to any pre-trained word vectors. The pro- Andrea Esuli, and Fabrizio Sebastiani. 2006. Senti-
posed method selects a set of semantically similar WordNet: A publicly available lexical resource for
opinion mining. In Proceedings of the 5th Interna-
nearest neighbors and then ranks the sentimentally
tional Conference on Language Resources and
similar neighbors higher and dissimilar neighbors Evaluation (LREC-06), pages 417-422.
lower based on a sentiment lexicon. This ranked
Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris
list can guide the refinement procedure to itera- Dyer, Eduard H. Hovy and Noah A. Smith. 2015.
tively improve the word vector representations. Retrofitting word vectors to semantic lexicons. In

Proceedings of the 2015 Conference of the North Advances in Neural Information Processing
American Chapter of the Association for Computa- Systems (NIPS-13), pages 1-9.
tional Linguistics – Human Language Technolo- Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey
gies (NAACL/HLT-15), pages 1606-1615. Dean. 2013b. Efficient estimation of word
Clayton J. Hutto and Eric Gilbert. 2014. VADER: A representations in vector space. In Proceedings of
parsimonious rule-based model for sentiment anal- the International Conference on Learning
ysis of social media text. In Proceedings of 8th In- Representations (ICLR-2013), pages 1-12.
ternational AAAI Conference on Weblogs and So- George A. Miller. 1995. WordNet: A lexical database
cial Media, pages 216-225. for English. Communications of the ACM,
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, 38(11):39-41.
and Hal Daumé III. 2015. Deep unordered Saif M. Mohammad, Bonnie J. Dorr, Graeme Hirst,
composition rivals syntactic methods for text and Peter D. Turney. 2013. Computing lexical
classification. In Proceedings of the 53rd Annual contrast. Computational Linguistics, 39(3):555-590.
Meeting of the Association for Computational Jeffrey Pennington, Richard Socher, and Christopher
Linguistics (ACL-15), pages 1681-1691. D. Manning. 2014. GloVe: Global vectors for word
Douwe Kiela, Felix Hill and Stephen Clark. 2015. representation. In Proceedings of the Conference
Specializing word embeddings for similarity or on Empirical Methods in Natural Language
relatedness. In Proceedings of the 2015 Conference Processing (EMNLP-14), pages 1532-1543.
on Empirical Methods in Natural Language Yafeng Ren, Yue Zhang, Meishan Zhang, and
Processing (EMNLP-15), pages 2044-2048. Donghong Ji. 2016. Improving twitter sentiment
Yoon Kim. 2014. Convolutional neural networks for classification using topic-enriched multi-prototype
sentence classification. In Proceedings of the 2014 word embeddings. In Proceedings of the 30th
Conference on Empirical Methods in Natural AAAI Conference on Artificial Intelligence (AAAI-
Language Processing (EMNLP-14), pages 1746- 16), pages 3038-3044.
1751. Richard Socher, Alex Perelygin, and Jy Wu. 2013.
Svetlana Kiritchenko and Saif M. Mohammad. 2016. Recursive deep models for semantic
The effect of negators, modals, and degree adverbs compositionality over a sentiment treebank. In
on sentiment composition. In Proceedings of the Proceedings of the Conference on Empirical
7th Workshop on Computational Approaches to Methods in Natural Language Processing
Subjectivity, Sentiment and Social Media Analysis (EMNLP-13), pages 1631-1642.
at NAACL-HLT 2016, pages 43-52. Maite Taboada, Julian Brooke, and Kimberly Voll.
Igor Labutov and Hod Lipson. 2013. Re-embedding 2011. Lexicon-based methods for sentiment
words. In Proceedings of the 51st Annual Meeting analysis. Computational linguistics, 37(2):267–307.
of the Association for Computational Linguistics Kai Sheng Tai, Richard Socher, and Christopher D.
(ACL-13), pages 489-493. Manning. 2015. Improved semantic representations
Man Lan, Zhihua Zhang, Yue Lu, and Ju Wu. 2016. from tree-structured long short-term memory
Three convolutional neural network-based models networks. In Proceedings of the 53rd Annual
for learning sentiment word vectors towards Meeting of the Association for Computational
sentiment analysis. In Proceedings of the 2016 Linguistics (ACL-15), pages 1556–1566.
International Joint Conference on Neural Duyu Tang, Furu Wei, Bing Qin, Nan Yang, Ting Liu,
Networks (IJCNN-16), pages 3172-3179. and Ming Zhou. 2016. Sentiment embeddings with
Moshe Looks, Marcello Herreshoff, DeLesley applications to sentiment analysis. IEEE Trans.
Hutchins, and Peter Norvig. 2017. Deep learning Knowledge abd Data Engineering, 28(2):496-509.
with dynamic computation graphs. In Proceedings Mike Thelwall, Kevan Buckley, and Georgios Pal-
of the 5th International Conference on Learning toglou. 2012. Sentiment strength detection for the
Representations (ICLR-17). social web. Journal of the Association for Infor-
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, mation Science and Technology, 63(1):163-173.
Dan Huang, Andrew Y. Ng, and Christopher Potts. Shih-Ming Wang and Lun-Wei Ku. 2016. ANTUSD:
2011. Learning word vectors for sentiment analysis. A large Chinese sentiment dictionary. In Proc. of
In Proceedings of the 49th Annual Meeting of the the 10th International Conference on Language
Association for Computational Linguistics (ACL- Resources and Evaluation (LREC-16), pages 2697-
11), pages 142-150. 2702.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Amy Beth Warriner, Victor Kuperman, and Marc
Dean. 2013a. Distributed representations of words Brysbaert. 2013. Norms of valence, arousal, and
and phrases and their compositionality. In dominance for 13,915 English lemmas. Behavior
Proceedings of the Annual Conference on Research Methods, 45(4):1191-1207.


View publication stats