Beruflich Dokumente
Kultur Dokumente
600
formity assumption, almost all words appear to asso-
10000
Texts
Texts
400
ciate with the ‘understand’ category, which is about
5000
200
four times bigger than the others; see table 1.) The
gray horizontal line is at 0.20, the expected proba-
0
0
bility if there is no association between the word’s 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Entropy Entropy
usage and the reaction categories.
The first panel in figure 1 depicts awesome. As (a) The full corpus. (b) Texts with > 5 reactions.
one might expect, this correlates most strongly with
the ‘rocks’ and ‘teehee’ categories; stories in which Figure 2: The entropy of the reaction distributions.
one uses this word are likely to be perceived as posi-
tive and light-hearted (especially as compared to the
3 Model
usual EP fare). Conversely, terrible, in the second
plot, correlates with ‘hugs’ and ‘understand’; when We introduce a model that captures semantic asso-
an author describes something as terrible, readers re- ciations among words as well as the blended distri-
act with sympathy and solidarity. The final panel butional sentiment information conveyed by words.
depicts cocaine, one of a handful of words in the We assume each word is represented by a real-
corpus that generate predominantly ‘wow, just wow’ valued vector and use a probabilistic model to learn
reactions. We hope these examples help convey the words’ vector representations from data. The learn-
nature of the reaction categories and also suggest ing procedure uses the unsupervised information of
that it is promising to try to use these data to learn document-level word co-occurrences as well as the
sentiment-rich word vectors. reaction distributions present in EP data.
Finally, we address the question of how much the Our work fits into the broad class of vector space
distributions matter as compared with a categori- models (VSMs), recently reviewed by Turney and
cal view of sentiment. If the majority of the texts Pantel (2010). VSMs capture word relationships
in the data received categorical or near-categorical by encoding words as points in a high-dimensional
responses, we might conclude that classification is space. The models are both flexible and powerful;
awesome (224 tokens) terrible (388 tokens) cocaine (56 tokens)
0.4
0.37
0.28
0.24 0.24
P(c|w)
0.21
0.15
0.12
0.08
hugs
rocks
teehee
understand
just wow
hugs
rocks
teehee
understand
just wow
hugs
rocks
teehee
understand
just wow
Figure 1: Word–category associations in the EP data.
depending on the application, the vector space can categorical notion of sentiment that is expressed in
encode syntactic information, as is useful for named the EP data. In the following sections, we intro-
entity recognition systems (Turian et al., 2010), or duce the semantic and sentiment components of the
semantic information, as is useful for information model separately, and then describe the procedure
retrieval or document classification (Manning et al., for learning the model’s parameters from data.
2008). Most VSMs apply some sort of matrix fac-
torization technique to a term-context co-occurrence 3.1 Semantic Component
matrix. However, the success of matrix factorization We approximately capture word semantics from a
techniques for word vectors often depends heavily collection of documents by analyzing document-
on the choices one makes for weighting the entries level word co-occurrences. This semantic compo-
(for example, with inverse document frequency of nent uses a probabilistic model of a document as in-
words). Thus, the process of building a VSM re- troduced in our previous work (Anonymous, 2011).
quires many design choices, often with only past The model uses a continuous mixture distribution
empirical results as guidance. This challenge is mul- over words indexed by a multi-dimensional random
tiplied when building representations for sentiment variable θ. Informally, we can think of each dimen-
because we want word vectors to capture both de- sion of a word vector as a topic in the sense of topic
scriptive and emotive meanings. The recently intro- modeling. The document coefficient vector θ thus
duced delta inverse document frequency weighting encodes the strength of each topic for the document.
technique has had some success in binary sentiment A word’s probability in the document then corre-
categorization (Martineau and Finin, 2009), but it sponds to how strongly the word’s topic strengths
does not naturally handle multi-dimensional notions match those defined by θ.
of sentiment. We assign a probability to a document d using
a joint distribution over the document and θ. The
Our recent work seeks to address these design is- model assumes each word wi ∈ d is conditionally
sues. In Anonymous (2011), we introduce a prob- independent of the other words given θ, a bag of
abilistic model for learning semantically-sensitive words assumption often used when learning from
word vectors. In the present paper, we build off document-level co-occurrences. The probability of
of this probabilistic model of documents, because it a document is thus,
helps avoid the large design space present in matrix
Z Z N
factorization-based VSMs, but we extend its sen- Y
p(d) = p(d, θ)dθ = p(θ) p(wi |θ)dθ, (1)
timent component considerably. Whereas we pre-
i=1
viously learned only from unique labels, we are
now able to capture the multi-dimensional, non- where N is the number of words in d and wi is the
ith word in d. where θ̂k denotes the MAP estimate of θ for dk . Our
The conditional distribution p(wi |θ) is defined us- previous work used a Gaussian prior for θ. In our
ing a softmax distribution, present experiments we explore both Gaussian and
Laplacian priors. The Laplacian prior is intuitively
exp(θT φw + bw ) appealing because it encourages sparsity, where cer-
p(w|θ; R, b) = P T
. (2)
w0 ∈V exp(θ φw0 + bw0 ) tain entries of θ are exactly zero as opposed to small
non-zero values as is the case when using Gaussian
The parameters of the model are the word repre- priors. These exactly zero values correspond to topic
sentation matrix R ∈ R(β x |V |) where each word dimensions which are not at all present in the seman-
w (represented as a one-on vector) in the vocab- tic representation of a word.
ulary V has a β-dimensional vector representation
φw = Rw corresponding to that word’s column in 3.2 Sentiment Component
R. The random variable θ is also a β-dimensional We now introduce the second component of our
vector, which weights each of the β dimensions of model, which aims to capture the multi-dimensional
words’ representation vectors. A scalar bias bw for sentiment information expressed by words. Unlike
each word captures differences in overall word fre- topical information, sentiment is not easy to learn
quencies. The probability of a word given the doc- by analyzing document-level word co-occurrences
ument parameter θ corresponds to how strongly that alone. For this reason, we use the reaction dis-
word’s vector representation φw matches the scaling tributions of documents to capture how words in
direction of θ. the document express multi-dimensional sentiment
Equation 1 resembles the probabilistic model of information. Our previous work demonstrated the
latent Dirichlet allocation (LDA) (Blei et al., 2003), value of learning sentiment-sensitive word represen-
which models documents as mixtures of latent top- tations for the simplistic binary categorization no-
ics. However, our model does not attempt to model tion of sentiment. We now introduce a method to
individual topics, but instead directly models word learn word vectors sensitive to a continuous multi-
probabilities conditioned on the topic mixture vari- dimensional notion of sentiment.
able θ. Our previous work compares the word vec- Our model dictates that a word vector φ should
tors learned with our semantic component to an ap- predict the reaction distribution of documents in
proach which uses LDA topic associations as word which that word occurs using an appropriate predic-
vectors. We found the word vectors learned with tor function. Because the reaction distributions are
our model to be superior in tasks of document and categorical probability distributions, we use a soft-
sentence-level sentiment classification. max model,
Maximum likelihood learning in this model as-
sumes documents dk in a collection D are i.i.d. sam- exp(ψkT φ + ck )
ples. The learning problem of finding parameters to ŝk = P T
(5)
k0 exp(ψk0 φ + ck0 )
maximize the probability of observed documents be-
comes, The value ŝk is the probability predicted for the k th
Nk
sentiment dimension for a given word vector φ. The
Y Z Y softmax weight vectors ψk serve to partition the vec-
max p(D; R, b) = p(θ) p(wi |θ; R, b)dθ.
R,b tor space into K regions where each region corre-
dk ∈D i=1
(3) sponds to a particular sentiment dimension. The pre-
dicted reaction distribution for a word thus depends
Using maximum a posteriori (MAP) estimates for θ, on where that word lies in the vector space relative
we approximate this learning problem as, to the regions defined by ψ.
For EP data, a document d is associated with its
Y Nk
Y reaction distribution s, which is a five-dimensional
max p(θ̂k ) p(wi |θ̂k ; R, b), (4) categorical probability distribution (K = 5). The
R,b
dk ∈D i=1 softmax parameters ψ ∈ RK x β and c ∈ RK are
shared across all word vectors as to create a sin- We minimize the objective function for several it-
gle set of emotive regions in the word vector space. eration using the L-BFGS quasi-Newton algorithm
The softmax predicts a reaction distribution for each while leaving the MAP estimates θ̂ fixed. The MAP
word, and we learn the softmax parameters as well estimates are then updated while leaving the other
as the word vectors to match the observed reaction parameters of the model fixed. This process contin-
distributions. ues until the objective function value converges.
The predicted and actual reaction distributions are Our work explores both a Gaussian and a Lapla-
categorical probability distributions, so we use the cian prior for θ. The log-Gaussian priorPcorre-
Kullback-Leibler (KL) divergence as a measure of sponds to a squared `2 (sum of squares, 2
i xi )
how closely the predicted distribution matches the penalty on θ whereas the Laplacian priorPcorre-
actual. Given the actual distribution s and a predic- sponds to an `1 (sum of absolute values, i |xi |)
tion for this distribution ŝ the KL divergence is, penalty. Both priors have a single free parame-
K ter λ which is proportional to the variance of the
X sk prior distribution. This regularization parameter λ
KL(ŝ||s) = sk log . (6)
ŝk and the word vector dimensionality β are the only
k=1
Learning this component of the model amounts to free hyper-parameters of the model. Because opti-
finding word vectors as well as softmax parame- mizing the non-differentiable `1 penalty is difficult
ters to minimize the KL divergence between reaction with gradient-based techniques we approximate the
distributions of observed documents and the pre- `1 penalty with the function log cosh(θ).
dicted reaction distributions of words occurring in
the documents. We can formally express this as, 4 Experiments
Nk
X X Our experiments focus on predicting the reaction
min KL(ŝwi ||s), (7) distribution given the text of a document. We em-
R,ψ,c
dk ∈D i=1 ploy several baseline approaches to assess the rel-
where ŝwiis the predicted reaction distribution for ative performance of our model. As shown in fig-
word wi as computed by (5). To ensure identifi- ure 2, the reaction distributions of stories which re-
ability of the softmax parameters ψ we constrain ceived at least five reactions have higher entropy on
ψK = 0 average than the set which includes stories with only
one reaction or more. The higher entropy reaction
3.3 Learning distributions are of greater interest because predict-
We now describe the method to learn word vectors ing such distributions is substantially more challeng-
using both the semantic and sentiment components ing than predicting a low entropy distribution, which
of the model. The learning procedure for the se- is more like the categorization approach of previous
mantic component minimizes the negative log of the work. We evaluate models on both the set of texts
likelihood shown in equation (4). The sentiment with at least one reaction, and the set of texts with
component is then additively combined to form the five or more reactions.
full learning problem, After collecting the text and reaction distributions
from the Web, we tokenized all documents with at
Nk
X X least one reaction. Traditional stop word removal
min λ||R||2F + KL(ŝwi ||s)
R,b,ψ,c was not used because certain stop words (e.g. nega-
dk ∈D i=1
tions) are indicative of sentiment. To minimize the
|D| Nk
X X amount of text pre-processing, we did not apply
− log p(θ̂k ) + log p(wi |θ̂k ; R, b) . (8) stemming or spelling correction. Because certain
k=1 i=1
non-word tokens (e.g. “!” and “:-)” ) are indicative
We add to the objective Frobenious norm regular- of sentiment, we allow them in our vocabulary. Af-
ization on the word representation matrix R to pre- ter this tokenization, the dataset consists of 52,973
vent the word vector norms from growing too large. unique unigrams, many of which occur only once
because they are unique spellings of words (e.g. reaction distributions. A cluster of words containing
“hahhhaaa” ). The collection of 37,146 documents is playful, upbeat tokens like “:-)” and “haha” are all
reduced to 37,130 when we discard documents with likely to appear in stories which elicit the rock or tee-
no tokens recognized by our tokenizer. Most stories hee reactions. Far removed from such happy words
fall around the median length of 56 words, however, are clusters of words indicative of melancholic sub-
a few are thousands of words long. We randomly jects, marked by words like “cancer” and “suicide.”
partitioned the data into 30,000 training and 7,130 We note that sad and troubling topics are highly
test documents. When we consider documents with prevalent in the data, and our visualization reflects
at least five reactions, this becomes 5,764 training this fact.
and 1,307 test documents. After learning the word representations, we rep-
resent documents using average word vectors. This
4.1 Word Representation Learning approach uses the arithmetic average of the word
We induce word representations with our model us- vectors for all words which appear in the document.
ing the learning procedure described in section 3.3. Because we learn word vectors for only the 5,000
We construct word representations for only the most frequent words, a small fraction of the docu-
5,000 most frequent tokens in the training data. This ments contain only words for which we do not have
speeds computation and avoids learning uninforma- vector representations. These documents are repre-
tive representations for rare words for which there sented as a vector of all zeros.
is insufficient data to properly assess their semantic
and sentiment associations. We use the 29,591 doc- 4.2 Alternative Methods
uments from our training set with length at least five
In addition to the vectors induced using our model,
when the vocabulary is restricted to the 5,000 most
we evaluate the performance of several standard ap-
frequent tokens. The reaction distributions for doc-
proaches to document categorization and informa-
uments are used when learning the sentiment com-
tion retrieval.
ponent of the model. Our model could leverage ad-
ditional unlabeled data from related websites to bet- Unigram Bag of Words Representing a document
ter capture the semantic associations among words. as a vector of word counts performs surprisingly
However, we restrict the model to learn from only well in many classification tasks. In our preliminary
the labeled training set in order to better compare it experiments, we found that term presence performs
to baseline models for this task. better than term frequency on EP data, as noted
For both the Gaussian and Laplacian models, we in previous work on sentiment (Pang et al., 2002).
evaluate 100-dimensional word vectors and set the We also note that delta inverse document frequency
regularization parameter λ = 10−4 . Our previous weighting, which has been shown to sometimes per-
work and preliminary experiments with this dataset form well in sentiment (Martineau and Finin, 2009),
suggested the learned word vectors are relatively in- does not extend easily to multi-dimensional notions
sensitive to changes in these parameters. of sentiment. We thus use term presence vectors
Supplementary diagram A shows a 2-D visualiza- with no normalization and evaluate with the full vo-
tion of the learned word similarities for the 2,000 cabulary of the dataset and the 5,000 word vocabu-
most frequent words in our vocabulary. The visual- lary used in building word vectors.
ization was created using the t-SNE algorithm, with
code provided by van der Maaten and Hinton (2008). Latent Semantic Analysis (LSA) We apply trun-
Word vectors are cosine normalized before passing cated singular value decomposition to a term-
them to the t-SNE algorithm. document count matrix to obtain word vectors from
The visualization clearly shows words grouped LSA (Deerwester et al., 1990). We first apply tf.idf
locally by semantic associations — for example, weighting to the term-document matrix, but do not
“doctor” and “medication” are nearby. Additionally, use cosine normalization. We use the same 5,000
there is some evidence that the macroscopic struc- word vocabulary as is used when constructing word
ture of the words correlates with how they influence vectors for our model.
> 5 reactions > 1 reaction
Features KL Max Acc. KL Max Acc.
Uniform Reactions 0.861 20.2 1.275 20.4
Mean Training Reactions 0.763 43.0 1.133 46.7
Bag of Words (All unigrams) 0.637 56.0 1.000 53.4
Bag of Words (Top 5000 unigrams) 0.640 54.9 0.992 54.3
LSA 0.667 51.8 1.032 52.2
Our Method Laplacian Prior 0.621 55.7 0.991 54.7
Our Method Gaussian Prior 0.620 55.2 0.991 54.6
4.3 High Entropy Reaction Distributions provements in both KL divergence and accuracy are
substantial relative to these simplistic baselines, sug-
Our first experiment considers only the examples
gesting that it is indeed feasible to predict reaction
with at least five reaction clicks, because they best
distributions from text. Both variants of our model
exhibit the blended distributional notion of senti-
perform better than bag of words and LSA in KL
ment of interest in this work. For all of the fea-
divergence, but bag of words performs best using
ture sets described (mean word vectors and bag of
the accuracy as the metric. That the accuracy and
words), we train a softmax classifier on the train-
KL metrics disagree on models’ performance rank-
ing set. The softmax classifier is a predictor of the
ings suggests categorization accuracy is not a suf-
same form as is described in equation (5), but with
ficient indicator of how well models capture a dis-
a quadratic regularization penalty on the weights.
tributional notion of sentiment. Based on the poor
The strength of the regularization penalty is set by
performance of LSA-derived word vectors, we hy-
cross-validation on the training set. The classifier is
pothesize that learning representations using senti-
trained to minimize the KL divergence of predicted
ment distributions is critical when attempting to cap-
and actual distributions on the training set. We then
ture the blended sentiment information within docu-
evaluate the models by measuring average KL diver-
ments.
gence on the test set.
We also report performance of models in terms Differences in KL divergence are somewhat diffi-
of accuracy in predicting the maximum probability cult to interpret, so we use a matched t-test to eval-
reaction for a document. In this setting, the model uate their significance. The matched t-test between
picks a single category corresponding to its most two models takes the KL divergence for each test
probable predicted reaction. A prediction is counted example and evaluates the hypothesis that the KL
as correct if that category is the most probable in divergence numbers come from the same distribu-
the true reaction distribution, or if it is tied with tion. KL divergences on the set of test examples
other categories for the role of most probable. None are approximately gamma distributed with a valid
of the models were explicitly optimized to perform range of [0, ∞]. We thus apply the matched t-test to
this task, but instead to predict the full distribution the logarithm of the KL divergences, which have a
of reactions. However, it is helpful to compare this Gaussian distribution as assumed by the t-test. We
performance metric to KL divergence, as measuring find that the difference in KL divergence between
performance in terms of accuracy is more familiar. our models and the bag of words models are sig-
Table 3 shows the results; recall that lower average nificant (p < 0.001). However, the Gaussian and
KL divergence indicates better performance. Laplacian prior variants of our model do not differ
All bag of words and vector space models beat the significantly from each other. The prior over doc-
simplistic baselines of predicting the average reac- ument coefficients perhaps has little effect relative
tion distribution, or a uniform distribution. The im- to the other components of our model, causing both
model variants to perform comparably. ion. The model is successful in absolute terms, sug-
gesting that learning realistic sentiment distributions
4.4 All Reaction Distributions is tractable, and it also outperforms various base-
We repeated the experimental procedure using the lines, including LSA. We believe the task of predict-
full dataset which includes all documents with at ing sentiment distributions from text provides a rich
least one reaction. As noted in figure 2, these reac- challenge for the field of sentiment analysis, espe-
tion distributions have low average entropy because cially when compared to simpler classification tasks.
a large number of documents have only a few re- Going forward, we plan to move beyond the lexical
actions. Distribution predictors for all models were level to capture the ways in which sentiment is in-
trained and evaluated on this dataset; table 3 shows fluenced by compositional semantic facts (e.g., in-
the results. teraction with negation and other non-veridical op-
Again all models outperform the naive baselines erators), which we expect to provide further insights
of guessing the average training distribution or a uni- into the complexities of sentiment expression.
form distribution. A third baseline (not show) which
assigns 99% of its probability mass to the dominant
Acknowledgments
understand category performs substantially worse This work is supported by the DARPA Deep Learn-
than all results shown. Although the difference in ing program under contract number FA8650-10-C-
KL divergence between our models and the bag of 7020, an NSF Graduate Fellowship awarded to AM,
words baselines are numerically small, the improve- ONR grant No. N00014-10-1-0109, and ARO grant
ment of our models is significant as measured by the No. W911NF-07-1-0216.
matched t-test (p < 0.001). The significance of such
small differences is due to the large testing set size. References
Again the Gaussian and Laplacian variants of our
model do not differ significantly from each other in Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat.
performance. 2005. Emotions from text: Machine learning for text-
based emotion prediction. In Proceedings of the Hu-
We see that all models have a higher average KL
man Language Technology Conference and the Con-
divergence on this task as compared to evaluation ference on Empirical Methods in Natural Language
on the set of documents with at least five reactions. Processing (HLT/EMNLP).
As shown in table 2, reaction distributions with zero David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
entropy dominate this version of the dataset. We 2003. Latent dirichlet allocation. Journal of Machine
hypothesize that the higher average KL divergences Learning Research, 3(4-5):993–1022, May.
and small numerical differences in KL divergence Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog-
are largely due to all predictors struggling to fit these nizing subjectivity: A case study in manual tagging.
zero entropy distributions which were formed with Natural Language Engineering, 5(2).
only one reaction click. Luı́s Cabral and Ali Hortaçsu. 2006. The dynamics
of seller reputation: Theory and evidence from eBay.
5 Conclusion Working paper, downloaded version revised in March.
Scott Deerwester, Susan T. Dumais, George W. Furnas,
Using the confessions at the EP, we showed that nat- Thomas K. Landauer, and Richard Harshman. 1990.
ural language texts often convey a wide range of sen- Indexing by latent semantic analysis. Journal of the
timent information to varying degrees. While classi- American Society for Information Science, 41(6):391–
407, September.
fication models can capture certain emotive dimen-
Paul Ekman. 1992. An argument for basic emotions.
sions, they miss this blended, continuous nature of
Cognition and Emotion,, 6(3/4):169–200.
sentiment expression. Building on the existing clas-
Andrew B. Goldberg and Jerry Zhu. 2006. Seeing
sifier model of Anonymous (2011), we developed stars when there aren’t many stars: Graph-based semi-
a vector-space model that learns from distributions supervised leaarning for sentiment categorization. In
over emotive categories, in addition to capturing ba- TextGraphs: HLT/NAACL Workshop on Graph-based
sic semantic information in an unsupervised fash- Algorithms for Natural Language Processing.
Vasileios Hatzivassiloglou and Janyce Wiebe. 2000. Ef- Ellen Riloff and Janyce Wiebe. 2003. Learning extrac-
fects of adjective orientation and gradability on sen- tion patterns for subjective expressions. In Proceed-
tence subjectivity. In Proceedings of the International ings of the Conference on Empirical Methods in Natu-
Conference on Computational Linguistics (COLING). ral Language Processing (EMNLP).
Hugo Liu, Henry Lieberman, and Ted Selker. 2003. Ellen Riloff, Janyce Wiebe, and William Phillips. 2005.
A model of textual affect sensing using real-world Exploiting subjectivity classification to improve infor-
knowledge. In Proceedings of Intelligent User Inter- mation extraction. In Proceedings of AAAI, pages
faces (IUI), pages 125–132. 1106–1111.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan James A. Russell. 1980. A circumplex model of af-
Huang, Andrew Y. Ng, and Christopher Potts. 2011. fect. Journal of Personality and Social Psychology,
Learning word vectors for sentiment analysis. In Pro- 39(6):1161–1178.
ceedings of the 49th Annual Meeting of the Associa- Benjamin Snyder and Regina Barzilay. 2007. Multiple
tion for Computational Linguistics: Human Language aspect ranking using the Good Grief algorithm. In
Technologies, pages 142–150, Portland, Oregon, USA, Proceedings of the Joint Human Language Technol-
June. Association for Computational Linguistics. ogy/North American Chapter of the ACL Conference
Christopher D. Manning, Prabhakar Raghavan, and Hin- (HLT-NAACL), pages 300–307.
rich Schütze. 2008. Introduction to Information Re- Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get
trieval. Cambridge University Press, 1 edition. out the vote: Determining support or opposition from
J. Martineau and T. Finin. 2009. Delta tfidf: An im- Congressional floor-debate transcripts. In Proceed-
proved feature space for sentiment analysis. In Pro- ings of EMNLP, pages 327–335.
ceedings of the third AAAI internatonal conference on Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
weblogs and social media. Word representations: A simple and general method
Alena Neviarouskaya, Helmut Prendinger, and Mitsuru for semi-supervised learning. In Proceedings of the
Ishizuka. 2010. Recognition of affect, judgment, and ACL.
appreciation in text. In Proceedings of the 23rd In- Peter D. Turney and Patrick Pantel. 2010. From fre-
ternational Conference on Computational Linguistics quency to meaning: Vector space models of semantics.
(COLING 2010), pages 806–814, Beijing, China, Au- Journal of Artificial Intelligence Research, 37:141–
gust. COLING 2010 Organizing Committee. 188.
Bo Pang and Lillian Lee. 2004. A sentimental education: Peter Turney. 2002. Thumbs up or thumbs down? Se-
Sentiment analysis using subjectivity summarization mantic orientation applied to unsupervised classifica-
based on minimum cuts. In Proceedings of the As- tion of reviews. In Proceedings of the Association for
sociation for Computational Linguistics (ACL), pages Computational Linguistics (ACL), pages 417–424.
271–278. Laurens van der Maaten and Geoffrey Hinton. 2008.
Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- Visualizing Data using t-SNE. Journal of Machine
ploiting class relationships for sentiment categoriza- Learning Research, 9:2579–2605, November.
tion with respect to rating scales. In Proceedings of Janyce M. Wiebe, Rebecca F. Bruce, and Thomas P.
the 43rd Annual Meeting of the Association for Com- O’Hara. 1999. Development and use of a gold stan-
putational Linguistics (ACL’05), pages 115–124, Ann dard data set for subjectivity classifications. In Pro-
Arbor, Michigan, June. Association for Computational ceedings of the Association for Computational Lin-
Linguistics. guistics (ACL), pages 246–253.
Bo Pang and Lillian Lee. 2008. Opinion mining and Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005.
sentiment analysis. Foundations and Trends in Infor- Annotating expressions of opinions and emotions in
mation Retrieval, 2(1):1–135. language. Language Resources and Evaluation (for-
merly Computers and the Humanities), 39(2/3):164–
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
210.
2002. Thumbs up? sentiment classification using
Theresa Wilson, Janyce Wiebe, and Rebecca Hwa. 2006.
machine learning techniques. In Proceedings of the
Just how mad are you? Finding strong and weak opin-
Conference on Empirical Methods in Natural Lan-
ion clauses. Computational Intelligence, 2(22):73–99.
guage Processing (EMNLP), pages 79–86, Philadel-
phia, July. Association for Computational Linguistics.
Christopher Potts. 2010. On the negativity of negation.
In David Lutz and Nan Li, editors, Proceedings of Se-
mantics and Linguistic Theory 20. CLC Publications,
Ithaca, NY.