Computer PDF

Multi-Dimensional Sentiment Analysis with Learned Representations
Andrew L. Maas, Andrew Y. Ng, and Christopher Potts

Stanford University
Stanford, CA 94305
[amaas, ang, cgpotts]@stanford.edu
Abstract ploying other categories: not only binary distinc-

tions like subjective vs. objective (Bruce and Wiebe,
Treating sentiment analysis as a classifica- 1999; Wiebe et al., 1999; Hatzivassiloglou and
tion problem has proven extremely useful, Wiebe, 2000; Riloff and Wiebe, 2003; Riloff et al.,
but it misses the blended, continuous nature 2005; Pang and Lee, 2004) and pro vs. con (Thomas
of sentiment expression in natural language.
et al., 2006), but also rich multidimensional category
Using data from the Experience Project, we
study texts as distributions over sentiment cat- sets modeled on those of cognitive psychology (Liu
egories. Analysis of the document collection et al., 2003; Alm et al., 2005; Wiebe et al., 2005;
shows the texts contain blended sentiment in- Neviarouskaya et al., 2010).
formation substantially different from a cate- While treating sentiment as a classification prob-
gorization view of sentiment. We introduce a lem is extremely useful for a wide range of tasks, it
statistical vector-space model that learns from is just an approximation of the sentiment informa-
distributions over emotive categories, in addition that can be conveyed linguistically. The cen-
tion to capturing basic semantic information
tral assumption of the classification approach is that
in an unsupervised fashion. Our model out-
performs several baselines in predicting senti- each text is uniquely labeled by one of the cate-
ment distributions given only the text of a doc- gories. However, human reactions are often nu-
ument. anced, blended, and continuous (Russell, 1980; Ek-
man, 1992; Wilson et al., 2006). Consider, for ex-
ample, this short ‘confession’ text from the website
1 Introduction ExperienceProject.com:
Computational sentiment analysis is often reduced I have a crush on my boss! *blush* eeek
to a classification task: each text is presumed to *back to work*
have a unique label summarizing its overall senti-
ment, and the goal is to build models that accurately At the Experience Project, users can react to texts
predict those labels (Turney, 2002; Pang et al., 2002; by clicking buttons summarizing a range of emo-
Pang and Lee, 2008). The most widely-used la- tions: ‘sorry, hugs’, ‘that rocks’, ‘tee-hee’, ‘I under-
bels are ‘positive’ and ‘negative’, with a third ‘neu- stand’, and ‘wow, just wow’. At the time of this writ-
tral’ category also commonly included (Cabral and ing, the above confession had received the following
Hortaçsu, 2006). Sometimes this basic approach is distribution of reactions: ‘that rocks’: 1, ‘tee-hee’:
enriched to a ranked or partially-ranked set of cat- 1, ‘I understand’: 10, and ‘wow, just wow’: 0. This
egories — for example, star ratings of the sort that corresponds well to the mix of human responses we
are extremely common on the Web (Pang and Lee, might expect this text to elicit: it describes a so-
2005; Goldberg and Zhu, 2006; Snyder and Barzi- cially awkward and complex situation, which pro-
lay, 2007). And there is a large body of work em- vokes sympathetic reactions, but the text is light-
hearted in tone and thus likely to elicit less weighty Category Clicks
rections as well. The comments on the confession
‘sorry, hugs’ 22,236 (19%)
reflect the summary offered by the reaction distribu-
‘you rock’ 25,416 (22%)
tion: some users tease (“Oooooooooo. . . . i’m tel-
‘teehee’ 16,052 (14%)
llin!!! lol”) and others offer encouragement (“you
‘I understand’ 42,352 (37%)
go and get that man. . . ”).
‘wow, just wow’ 9,745 (8%)
In this paper, we develop an approach that al-
lows us to embrace the blended, continuous na- Table 1: Overall distribution of reactions.
ture of human sentiment judgments. Our primary
data are about 37,000 confessions from the Experi-
ence Project with associated reaction distributions. upload a variety of different kinds of texts, to com-
We focus on predicting those reaction distributions ment on others’ texts, and to contribute to annotat-
given only the confession text. This problem is ing the texts with information about their reactions.
substantially more challenging than simple classifi- We focus on the ‘confessions’, which are typically
cation, but we show that it is tractable and that it short, informal texts relating personal stories, atti-
presents a worthwhile set of new questions for re- tudes, and emotions. Here are two typical confes-
search in linguistics, natural language processing, sions with their associated reactions:
and machine learning.
I really hate being shy . . . I just want to
At the heart of our approach is a model that learns be able to talk to someone about anything
vector representations of words. The model has both and everything and be myself. . . That’s all
supervised and unsupervised components. The un- I’ve ever wanted. [understand: 10; hugs:
supervised component captures basic semantic in- 1; just wow: 0; rock: 1; teehee: 2]
formation distributionally. However, this document-
level distributional information misses important subconsciously, I constantly narrate my
sentiment content. We thus rely on our labeled data own life in my head. in third person. in
to imbue the word vectors with rich emotive infor- a british accent. Insane? Probably [under-
mation. stand: 0; hugs: 0; just wow: 1; rock: 7;
Visualization of our model’s learned word rep- teehee: 8]
resentations shows multiple levels of word similar-
ity (supplementary diagram A). At the macroscopic Our data consist of 37,146 texts (3,564,039
level, words are grouped into large clusters based on words; median text length of 56 words). Table 1
the reaction distributions they are likely to elicit, rel- provides some basic information about the overall
fecting their sentiment connotations. Within these distribution of reactions. They are highly skewed
macroscopic clusters, words with highly related de- towards the category ‘I understand’; the stories are
scriptive semantic content form sub-structures. confessional, so it is natural for readers to be sympa-
We evaluate our model based upon how well it thetic in response. The ‘wow, just wow’ category is
predicts the reaction distributions of stories, but we correspondingly little used, in virtue of the fact that
also report categorization accuracy as a point of ref- it is largely for negative exclamation (its associated
erence. To assess the impact of learning represen- emoticon has its mouth and eyes wide open). Such
tations specifically for sentiment, we compare our reactions are reserved largely for extremely trans-
model with several alternative techniques and find it gressive or shocking information.
performs significantly better in experiments on the We have restricted attention to the texts with at
Experience Project data. least one reaction. Table 2 summarizes the amount
of reaction data present in this document collec-
2 Data tion, by measuring cut-offs at various salient points.
When analyzing the reaction data, we normalize the
As noted above, our data come from the website Ex- counts such that the distribution over reactions sums
perienceProject.com (EP). The site allows users to to 1. This allows us to treat the reaction data as a
Reactions Texts an appropriate modeling choice. Conversely, if the
texts tend to receive mixed reactions, then we are
>1 37, 146
justified in adopting our more complex approach.
>2 24, 179
Figure 2 assesses this using the entropy of the re-
>3 15, 813
action distributions. Where the entropy is zero, just
>4 10, 537
one category was chosen. Where the entropy is
>5 7, 073
around two, the reactions were evenly distributed
Table 2: Reaction counts.
across the categories. As is evident from this plot,
the overall picture is far from categorical; about one-
third of the texts have a non-negligible amount of
probability distribution, ignoring differences in the variation in their distributions. What’s more, this
raw number of counts stories receive. picture is somewhat misleading. As table 2 shows,
There are many intuitive correlations between the majority of our texts have just one reaction. If we
the authors’ word choices and readers’ reaction re- restrict attention to the 7,073 texts with at least five
sponses (Potts, 2010). Figure 1 illustrates this effect reactions, then the entropy values are more evenly
with words that show strong affinities to particular distributed, with an entropy of zero far less domi-
reaction types. Each panel depicts the distribution nant, as in figure 2(b). Thus, these texts manifest
of the word across the rating categories. These were the blended, continuous nature of sentiment that we
derived by first estimating P (w|c), the probability of wish to model.
word w given class c, and then obtaining P (c|w) by
an application of Bayes rule under the assumption of
15000
a uniform prior over the classes. (Without this uni-
600
formity assumption, almost all words appear to asso-
10000
Texts
Texts
400
ciate with the ‘understand’ category, which is about
5000
200
four times bigger than the others; see table 1.) The
gray horizontal line is at 0.20, the expected proba-
0
0
bility if there is no association between the word’s 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Entropy Entropy
usage and the reaction categories.
The first panel in figure 1 depicts awesome. As (a) The full corpus. (b) Texts with > 5 reactions.
one might expect, this correlates most strongly with
the ‘rocks’ and ‘teehee’ categories; stories in which Figure 2: The entropy of the reaction distributions.
one uses this word are likely to be perceived as posi-
tive and light-hearted (especially as compared to the
3 Model
usual EP fare). Conversely, terrible, in the second
plot, correlates with ‘hugs’ and ‘understand’; when We introduce a model that captures semantic asso-
an author describes something as terrible, readers re- ciations among words as well as the blended distri-
act with sympathy and solidarity. The final panel butional sentiment information conveyed by words.
depicts cocaine, one of a handful of words in the We assume each word is represented by a real-
corpus that generate predominantly ‘wow, just wow’ valued vector and use a probabilistic model to learn
reactions. We hope these examples help convey the words’ vector representations from data. The learn-
nature of the reaction categories and also suggest ing procedure uses the unsupervised information of
that it is promising to try to use these data to learn document-level word co-occurrences as well as the
sentiment-rich word vectors. reaction distributions present in EP data.
Finally, we address the question of how much the Our work fits into the broad class of vector space
distributions matter as compared with a categori- models (VSMs), recently reviewed by Turney and
cal view of sentiment. If the majority of the texts Pantel (2010). VSMs capture word relationships
in the data received categorical or near-categorical by encoding words as points in a high-dimensional
responses, we might conclude that classification is space. The models are both flexible and powerful;
awesome (224 tokens) terrible (388 tokens) cocaine (56 tokens)
0.4
0.37
0.28
0.24 0.24
P(c|w)
0.21
0.15
0.12
0.08
hugs
rocks
teehee
understand
just wow
hugs
rocks
teehee
understand
just wow
hugs
rocks
teehee
understand
just wow
Figure 1: Word–category associations in the EP data.
depending on the application, the vector space can categorical notion of sentiment that is expressed in
encode syntactic information, as is useful for named the EP data. In the following sections, we intro-
entity recognition systems (Turian et al., 2010), or duce the semantic and sentiment components of the
semantic information, as is useful for information model separately, and then describe the procedure
retrieval or document classification (Manning et al., for learning the model’s parameters from data.
2008). Most VSMs apply some sort of matrix fac-
torization technique to a term-context co-occurrence 3.1 Semantic Component
matrix. However, the success of matrix factorization We approximately capture word semantics from a
techniques for word vectors often depends heavily collection of documents by analyzing document-
on the choices one makes for weighting the entries level word co-occurrences. This semantic compo-
(for example, with inverse document frequency of nent uses a probabilistic model of a document as in-
words). Thus, the process of building a VSM re- troduced in our previous work (Anonymous, 2011).
quires many design choices, often with only past The model uses a continuous mixture distribution
empirical results as guidance. This challenge is mul- over words indexed by a multi-dimensional random
tiplied when building representations for sentiment variable θ. Informally, we can think of each dimen-
because we want word vectors to capture both de- sion of a word vector as a topic in the sense of topic
scriptive and emotive meanings. The recently intro- modeling. The document coefficient vector θ thus
duced delta inverse document frequency weighting encodes the strength of each topic for the document.
technique has had some success in binary sentiment A word’s probability in the document then corre-
categorization (Martineau and Finin, 2009), but it sponds to how strongly the word’s topic strengths
does not naturally handle multi-dimensional notions match those defined by θ.
of sentiment. We assign a probability to a document d using
a joint distribution over the document and θ. The
Our recent work seeks to address these design is- model assumes each word wi ∈ d is conditionally
sues. In Anonymous (2011), we introduce a prob- independent of the other words given θ, a bag of
abilistic model for learning semantically-sensitive words assumption often used when learning from
word vectors. In the present paper, we build off document-level co-occurrences. The probability of
of this probabilistic model of documents, because it a document is thus,
helps avoid the large design space present in matrix
Z Z N
factorization-based VSMs, but we extend its sen- Y
p(d) = p(d, θ)dθ = p(θ) p(wi |θ)dθ, (1)
timent component considerably. Whereas we pre-
i=1
viously learned only from unique labels, we are
now able to capture the multi-dimensional, non- where N is the number of words in d and wi is the
ith word in d. where θ̂k denotes the MAP estimate of θ for dk . Our
The conditional distribution p(wi |θ) is defined us- previous work used a Gaussian prior for θ. In our
ing a softmax distribution, present experiments we explore both Gaussian and
Laplacian priors. The Laplacian prior is intuitively
exp(θT φw + bw ) appealing because it encourages sparsity, where cer-
p(w|θ; R, b) = P T
. (2)
w0 ∈V exp(θ φw0 + bw0 ) tain entries of θ are exactly zero as opposed to small
non-zero values as is the case when using Gaussian
The parameters of the model are the word repre- priors. These exactly zero values correspond to topic
sentation matrix R ∈ R(β x |V |) where each word dimensions which are not at all present in the seman-
w (represented as a one-on vector) in the vocab- tic representation of a word.
ulary V has a β-dimensional vector representation
φw = Rw corresponding to that word’s column in 3.2 Sentiment Component
R. The random variable θ is also a β-dimensional We now introduce the second component of our
vector, which weights each of the β dimensions of model, which aims to capture the multi-dimensional
words’ representation vectors. A scalar bias bw for sentiment information expressed by words. Unlike
each word captures differences in overall word fre- topical information, sentiment is not easy to learn
quencies. The probability of a word given the doc- by analyzing document-level word co-occurrences
ument parameter θ corresponds to how strongly that alone. For this reason, we use the reaction dis-
word’s vector representation φw matches the scaling tributions of documents to capture how words in
direction of θ. the document express multi-dimensional sentiment
Equation 1 resembles the probabilistic model of information. Our previous work demonstrated the
latent Dirichlet allocation (LDA) (Blei et al., 2003), value of learning sentiment-sensitive word represen-
which models documents as mixtures of latent top- tations for the simplistic binary categorization no-
ics. However, our model does not attempt to model tion of sentiment. We now introduce a method to
individual topics, but instead directly models word learn word vectors sensitive to a continuous multi-
probabilities conditioned on the topic mixture vari- dimensional notion of sentiment.
able θ. Our previous work compares the word vec- Our model dictates that a word vector φ should
tors learned with our semantic component to an ap- predict the reaction distribution of documents in
proach which uses LDA topic associations as word which that word occurs using an appropriate predic-
vectors. We found the word vectors learned with tor function. Because the reaction distributions are
our model to be superior in tasks of document and categorical probability distributions, we use a soft-
sentence-level sentiment classification. max model,
Maximum likelihood learning in this model as-
sumes documents dk in a collection D are i.i.d. sam- exp(ψkT φ + ck )
ples. The learning problem of finding parameters to ŝk = P T
(5)
k0 exp(ψk0 φ + ck0 )
maximize the probability of observed documents be-
comes, The value ŝk is the probability predicted for the k th
Nk
sentiment dimension for a given word vector φ. The
Y Z Y softmax weight vectors ψk serve to partition the vec-
max p(D; R, b) = p(θ) p(wi |θ; R, b)dθ.
R,b tor space into K regions where each region corre-
dk ∈D i=1
(3) sponds to a particular sentiment dimension. The pre-
dicted reaction distribution for a word thus depends
Using maximum a posteriori (MAP) estimates for θ, on where that word lies in the vector space relative
we approximate this learning problem as, to the regions defined by ψ.
For EP data, a document d is associated with its
Y Nk
Y reaction distribution s, which is a five-dimensional
max p(θ̂k ) p(wi |θ̂k ; R, b), (4) categorical probability distribution (K = 5). The
R,b
dk ∈D i=1 softmax parameters ψ ∈ RK x β and c ∈ RK are
shared across all word vectors as to create a sin- We minimize the objective function for several it-
gle set of emotive regions in the word vector space. eration using the L-BFGS quasi-Newton algorithm
The softmax predicts a reaction distribution for each while leaving the MAP estimates θ̂ fixed. The MAP
word, and we learn the softmax parameters as well estimates are then updated while leaving the other
as the word vectors to match the observed reaction parameters of the model fixed. This process contin-
distributions. ues until the objective function value converges.
The predicted and actual reaction distributions are Our work explores both a Gaussian and a Lapla-
categorical probability distributions, so we use the cian prior for θ. The log-Gaussian priorPcorre-
Kullback-Leibler (KL) divergence as a measure of sponds to a squared `2 (sum of squares, 2
i xi )
how closely the predicted distribution matches the penalty on θ whereas the Laplacian priorPcorre-
actual. Given the actual distribution s and a predic- sponds to an `1 (sum of absolute values, i |xi |)
tion for this distribution ŝ the KL divergence is, penalty. Both priors have a single free parame-
K ter λ which is proportional to the variance of the
X sk prior distribution. This regularization parameter λ
KL(ŝ||s) = sk log . (6)
ŝk and the word vector dimensionality β are the only
k=1
Learning this component of the model amounts to free hyper-parameters of the model. Because opti-
finding word vectors as well as softmax parame- mizing the non-differentiable `1 penalty is difficult
ters to minimize the KL divergence between reaction with gradient-based techniques we approximate the
distributions of observed documents and the pre- `1 penalty with the function log cosh(θ).
dicted reaction distributions of words occurring in
the documents. We can formally express this as, 4 Experiments
Nk
X X Our experiments focus on predicting the reaction
min KL(ŝwi ||s), (7) distribution given the text of a document. We em-
R,ψ,c
dk ∈D i=1 ploy several baseline approaches to assess the rel-
where ŝwiis the predicted reaction distribution for ative performance of our model. As shown in fig-
word wi as computed by (5). To ensure identifi- ure 2, the reaction distributions of stories which re-
ability of the softmax parameters ψ we constrain ceived at least five reactions have higher entropy on
ψK = 0 average than the set which includes stories with only
one reaction or more. The higher entropy reaction
3.3 Learning distributions are of greater interest because predict-
We now describe the method to learn word vectors ing such distributions is substantially more challeng-
using both the semantic and sentiment components ing than predicting a low entropy distribution, which
of the model. The learning procedure for the se- is more like the categorization approach of previous
mantic component minimizes the negative log of the work. We evaluate models on both the set of texts
likelihood shown in equation (4). The sentiment with at least one reaction, and the set of texts with
component is then additively combined to form the five or more reactions.
full learning problem, After collecting the text and reaction distributions
from the Web, we tokenized all documents with at
Nk
X X least one reaction. Traditional stop word removal
min λ||R||2F + KL(ŝwi ||s)
R,b,ψ,c was not used because certain stop words (e.g. nega-
dk ∈D i=1
  tions) are indicative of sentiment. To minimize the
|D| Nk
X X amount of text pre-processing, we did not apply
− log p(θ̂k ) + log p(wi |θ̂k ; R, b) . (8) stemming or spelling correction. Because certain
k=1 i=1
non-word tokens (e.g. “!” and “:-)” ) are indicative
We add to the objective Frobenious norm regular- of sentiment, we allow them in our vocabulary. Af-
ization on the word representation matrix R to pre- ter this tokenization, the dataset consists of 52,973
vent the word vector norms from growing too large. unique unigrams, many of which occur only once
because they are unique spellings of words (e.g. reaction distributions. A cluster of words containing
“hahhhaaa” ). The collection of 37,146 documents is playful, upbeat tokens like “:-)” and “haha” are all
reduced to 37,130 when we discard documents with likely to appear in stories which elicit the rock or tee-
no tokens recognized by our tokenizer. Most stories hee reactions. Far removed from such happy words
fall around the median length of 56 words, however, are clusters of words indicative of melancholic sub-
a few are thousands of words long. We randomly jects, marked by words like “cancer” and “suicide.”
partitioned the data into 30,000 training and 7,130 We note that sad and troubling topics are highly
test documents. When we consider documents with prevalent in the data, and our visualization reflects
at least five reactions, this becomes 5,764 training this fact.
and 1,307 test documents. After learning the word representations, we rep-
resent documents using average word vectors. This
4.1 Word Representation Learning approach uses the arithmetic average of the word
We induce word representations with our model us- vectors for all words which appear in the document.
ing the learning procedure described in section 3.3. Because we learn word vectors for only the 5,000
We construct word representations for only the most frequent words, a small fraction of the docu-
5,000 most frequent tokens in the training data. This ments contain only words for which we do not have
speeds computation and avoids learning uninforma- vector representations. These documents are repre-
tive representations for rare words for which there sented as a vector of all zeros.
is insufficient data to properly assess their semantic
and sentiment associations. We use the 29,591 doc- 4.2 Alternative Methods
uments from our training set with length at least five
In addition to the vectors induced using our model,
when the vocabulary is restricted to the 5,000 most
we evaluate the performance of several standard ap-
frequent tokens. The reaction distributions for doc-
proaches to document categorization and informa-
uments are used when learning the sentiment com-
tion retrieval.
ponent of the model. Our model could leverage ad-
ditional unlabeled data from related websites to bet- Unigram Bag of Words Representing a document
ter capture the semantic associations among words. as a vector of word counts performs surprisingly
However, we restrict the model to learn from only well in many classification tasks. In our preliminary
the labeled training set in order to better compare it experiments, we found that term presence performs
to baseline models for this task. better than term frequency on EP data, as noted
For both the Gaussian and Laplacian models, we in previous work on sentiment (Pang et al., 2002).
evaluate 100-dimensional word vectors and set the We also note that delta inverse document frequency
regularization parameter λ = 10−4 . Our previous weighting, which has been shown to sometimes per-
work and preliminary experiments with this dataset form well in sentiment (Martineau and Finin, 2009),
suggested the learned word vectors are relatively in- does not extend easily to multi-dimensional notions
sensitive to changes in these parameters. of sentiment. We thus use term presence vectors
Supplementary diagram A shows a 2-D visualiza- with no normalization and evaluate with the full vo-
tion of the learned word similarities for the 2,000 cabulary of the dataset and the 5,000 word vocabu-
most frequent words in our vocabulary. The visual- lary used in building word vectors.
ization was created using the t-SNE algorithm, with
code provided by van der Maaten and Hinton (2008). Latent Semantic Analysis (LSA) We apply trun-
Word vectors are cosine normalized before passing cated singular value decomposition to a term-
them to the t-SNE algorithm. document count matrix to obtain word vectors from
The visualization clearly shows words grouped LSA (Deerwester et al., 1990). We first apply tf.idf
locally by semantic associations — for example, weighting to the term-document matrix, but do not
“doctor” and “medication” are nearby. Additionally, use cosine normalization. We use the same 5,000
there is some evidence that the macroscopic struc- word vocabulary as is used when constructing word
ture of the words correlates with how they influence vectors for our model.
> 5 reactions > 1 reaction
Features KL Max Acc. KL Max Acc.
Uniform Reactions 0.861 20.2 1.275 20.4
Mean Training Reactions 0.763 43.0 1.133 46.7
Bag of Words (All unigrams) 0.637 56.0 1.000 53.4
Bag of Words (Top 5000 unigrams) 0.640 54.9 0.992 54.3
LSA 0.667 51.8 1.032 52.2
Our Method Laplacian Prior 0.621 55.7 0.991 54.7
Our Method Gaussian Prior 0.620 55.2 0.991 54.6
Table 3: Test set performance.
4.3 High Entropy Reaction Distributions provements in both KL divergence and accuracy are
substantial relative to these simplistic baselines, sug-
Our first experiment considers only the examples
gesting that it is indeed feasible to predict reaction
with at least five reaction clicks, because they best
distributions from text. Both variants of our model
exhibit the blended distributional notion of senti-
perform better than bag of words and LSA in KL
ment of interest in this work. For all of the fea-
divergence, but bag of words performs best using
ture sets described (mean word vectors and bag of
the accuracy as the metric. That the accuracy and
words), we train a softmax classifier on the train-
KL metrics disagree on models’ performance rank-
ing set. The softmax classifier is a predictor of the
ings suggests categorization accuracy is not a suf-
same form as is described in equation (5), but with
ficient indicator of how well models capture a dis-
a quadratic regularization penalty on the weights.
tributional notion of sentiment. Based on the poor
The strength of the regularization penalty is set by
performance of LSA-derived word vectors, we hy-
cross-validation on the training set. The classifier is
pothesize that learning representations using senti-
trained to minimize the KL divergence of predicted
ment distributions is critical when attempting to cap-
and actual distributions on the training set. We then
ture the blended sentiment information within docu-
evaluate the models by measuring average KL diver-
ments.
gence on the test set.
We also report performance of models in terms Differences in KL divergence are somewhat diffi-
of accuracy in predicting the maximum probability cult to interpret, so we use a matched t-test to eval-
reaction for a document. In this setting, the model uate their significance. The matched t-test between
picks a single category corresponding to its most two models takes the KL divergence for each test
probable predicted reaction. A prediction is counted example and evaluates the hypothesis that the KL
as correct if that category is the most probable in divergence numbers come from the same distribu-
the true reaction distribution, or if it is tied with tion. KL divergences on the set of test examples
other categories for the role of most probable. None are approximately gamma distributed with a valid
of the models were explicitly optimized to perform range of [0, ∞]. We thus apply the matched t-test to
this task, but instead to predict the full distribution the logarithm of the KL divergences, which have a
of reactions. However, it is helpful to compare this Gaussian distribution as assumed by the t-test. We
performance metric to KL divergence, as measuring find that the difference in KL divergence between
performance in terms of accuracy is more familiar. our models and the bag of words models are sig-
Table 3 shows the results; recall that lower average nificant (p < 0.001). However, the Gaussian and
KL divergence indicates better performance. Laplacian prior variants of our model do not differ
All bag of words and vector space models beat the significantly from each other. The prior over doc-
simplistic baselines of predicting the average reac- ument coefficients perhaps has little effect relative
tion distribution, or a uniform distribution. The im- to the other components of our model, causing both
model variants to perform comparably. ion. The model is successful in absolute terms, sug-
gesting that learning realistic sentiment distributions
4.4 All Reaction Distributions is tractable, and it also outperforms various base-
We repeated the experimental procedure using the lines, including LSA. We believe the task of predict-
full dataset which includes all documents with at ing sentiment distributions from text provides a rich
least one reaction. As noted in figure 2, these reac- challenge for the field of sentiment analysis, espe-
tion distributions have low average entropy because cially when compared to simpler classification tasks.
a large number of documents have only a few re- Going forward, we plan to move beyond the lexical
actions. Distribution predictors for all models were level to capture the ways in which sentiment is in-
trained and evaluated on this dataset; table 3 shows fluenced by compositional semantic facts (e.g., in-
the results. teraction with negation and other non-veridical op-
Again all models outperform the naive baselines erators), which we expect to provide further insights
of guessing the average training distribution or a uni- into the complexities of sentiment expression.
form distribution. A third baseline (not show) which
assigns 99% of its probability mass to the dominant
Acknowledgments
understand category performs substantially worse This work is supported by the DARPA Deep Learn-
than all results shown. Although the difference in ing program under contract number FA8650-10-C-
KL divergence between our models and the bag of 7020, an NSF Graduate Fellowship awarded to AM,
words baselines are numerically small, the improve- ONR grant No. N00014-10-1-0109, and ARO grant
ment of our models is significant as measured by the No. W911NF-07-1-0216.
matched t-test (p < 0.001). The significance of such
small differences is due to the large testing set size. References
Again the Gaussian and Laplacian variants of our
model do not differ significantly from each other in Cecilia Ovesdotter Alm, Dan Roth, and Richard Sproat.
performance. 2005. Emotions from text: Machine learning for text-
based emotion prediction. In Proceedings of the Hu-
We see that all models have a higher average KL
man Language Technology Conference and the Con-
divergence on this task as compared to evaluation ference on Empirical Methods in Natural Language
on the set of documents with at least five reactions. Processing (HLT/EMNLP).
As shown in table 2, reaction distributions with zero David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
entropy dominate this version of the dataset. We 2003. Latent dirichlet allocation. Journal of Machine
hypothesize that the higher average KL divergences Learning Research, 3(4-5):993–1022, May.
and small numerical differences in KL divergence Rebecca F. Bruce and Janyce M. Wiebe. 1999. Recog-
are largely due to all predictors struggling to fit these nizing subjectivity: A case study in manual tagging.
zero entropy distributions which were formed with Natural Language Engineering, 5(2).
only one reaction click. Luı́s Cabral and Ali Hortaçsu. 2006. The dynamics
of seller reputation: Theory and evidence from eBay.
5 Conclusion Working paper, downloaded version revised in March.
Scott Deerwester, Susan T. Dumais, George W. Furnas,
Using the confessions at the EP, we showed that nat- Thomas K. Landauer, and Richard Harshman. 1990.
ural language texts often convey a wide range of sen- Indexing by latent semantic analysis. Journal of the
timent information to varying degrees. While classi- American Society for Information Science, 41(6):391–
407, September.
fication models can capture certain emotive dimen-
Paul Ekman. 1992. An argument for basic emotions.
sions, they miss this blended, continuous nature of
Cognition and Emotion,, 6(3/4):169–200.
sentiment expression. Building on the existing clas-
Andrew B. Goldberg and Jerry Zhu. 2006. Seeing
sifier model of Anonymous (2011), we developed stars when there aren’t many stars: Graph-based semi-
a vector-space model that learns from distributions supervised leaarning for sentiment categorization. In
over emotive categories, in addition to capturing ba- TextGraphs: HLT/NAACL Workshop on Graph-based
sic semantic information in an unsupervised fash- Algorithms for Natural Language Processing.
Vasileios Hatzivassiloglou and Janyce Wiebe. 2000. Ef- Ellen Riloff and Janyce Wiebe. 2003. Learning extrac-
fects of adjective orientation and gradability on sen- tion patterns for subjective expressions. In Proceed-
tence subjectivity. In Proceedings of the International ings of the Conference on Empirical Methods in Natu-
Conference on Computational Linguistics (COLING). ral Language Processing (EMNLP).
Hugo Liu, Henry Lieberman, and Ted Selker. 2003. Ellen Riloff, Janyce Wiebe, and William Phillips. 2005.
A model of textual affect sensing using real-world Exploiting subjectivity classification to improve infor-
knowledge. In Proceedings of Intelligent User Inter- mation extraction. In Proceedings of AAAI, pages
faces (IUI), pages 125–132. 1106–1111.
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan James A. Russell. 1980. A circumplex model of af-
Huang, Andrew Y. Ng, and Christopher Potts. 2011. fect. Journal of Personality and Social Psychology,
Learning word vectors for sentiment analysis. In Pro- 39(6):1161–1178.
ceedings of the 49th Annual Meeting of the Associa- Benjamin Snyder and Regina Barzilay. 2007. Multiple
tion for Computational Linguistics: Human Language aspect ranking using the Good Grief algorithm. In
Technologies, pages 142–150, Portland, Oregon, USA, Proceedings of the Joint Human Language Technol-
June. Association for Computational Linguistics. ogy/North American Chapter of the ACL Conference
Christopher D. Manning, Prabhakar Raghavan, and Hin- (HLT-NAACL), pages 300–307.
rich Schütze. 2008. Introduction to Information Re- Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get
trieval. Cambridge University Press, 1 edition. out the vote: Determining support or opposition from
J. Martineau and T. Finin. 2009. Delta tfidf: An im- Congressional floor-debate transcripts. In Proceed-
proved feature space for sentiment analysis. In Pro- ings of EMNLP, pages 327–335.
ceedings of the third AAAI internatonal conference on Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010.
weblogs and social media. Word representations: A simple and general method
Alena Neviarouskaya, Helmut Prendinger, and Mitsuru for semi-supervised learning. In Proceedings of the
Ishizuka. 2010. Recognition of affect, judgment, and ACL.
appreciation in text. In Proceedings of the 23rd In- Peter D. Turney and Patrick Pantel. 2010. From fre-
ternational Conference on Computational Linguistics quency to meaning: Vector space models of semantics.
(COLING 2010), pages 806–814, Beijing, China, Au- Journal of Artificial Intelligence Research, 37:141–
gust. COLING 2010 Organizing Committee. 188.
Bo Pang and Lillian Lee. 2004. A sentimental education: Peter Turney. 2002. Thumbs up or thumbs down? Se-
Sentiment analysis using subjectivity summarization mantic orientation applied to unsupervised classifica-
based on minimum cuts. In Proceedings of the As- tion of reviews. In Proceedings of the Association for
sociation for Computational Linguistics (ACL), pages Computational Linguistics (ACL), pages 417–424.
271–278. Laurens van der Maaten and Geoffrey Hinton. 2008.
Bo Pang and Lillian Lee. 2005. Seeing stars: Ex- Visualizing Data using t-SNE. Journal of Machine
ploiting class relationships for sentiment categoriza- Learning Research, 9:2579–2605, November.
tion with respect to rating scales. In Proceedings of Janyce M. Wiebe, Rebecca F. Bruce, and Thomas P.
the 43rd Annual Meeting of the Association for Com- O’Hara. 1999. Development and use of a gold stan-
putational Linguistics (ACL’05), pages 115–124, Ann dard data set for subjectivity classifications. In Pro-
Arbor, Michigan, June. Association for Computational ceedings of the Association for Computational Lin-
Linguistics. guistics (ACL), pages 246–253.
Bo Pang and Lillian Lee. 2008. Opinion mining and Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005.
sentiment analysis. Foundations and Trends in Infor- Annotating expressions of opinions and emotions in
mation Retrieval, 2(1):1–135. language. Language Resources and Evaluation (for-
merly Computers and the Humanities), 39(2/3):164–
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
210.
2002. Thumbs up? sentiment classification using
Theresa Wilson, Janyce Wiebe, and Rebecca Hwa. 2006.
machine learning techniques. In Proceedings of the
Just how mad are you? Finding strong and weak opin-
Conference on Empirical Methods in Natural Lan-
ion clauses. Computational Intelligence, 2(22):73–99.
guage Processing (EMNLP), pages 79–86, Philadel-
phia, July. Association for Computational Linguistics.
Christopher Potts. 2010. On the negativity of negation.
In David Lutz and Nan Li, editors, Proceedings of Se-
mantics and Linguistic Theory 20. CLC Publications,
Ithaca, NY.

Computer PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Computer PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Multi-Dimensional Sentiment Analysis with Learned Representations

Andrew L. Maas, Andrew Y. Ng, and Christopher Potts

Abstract ploying other categories: not only binary distinc-

a uniform prior over the classes. (Without this uni-

Table 3: Test set performance.

Das könnte Ihnen auch gefallen