You are on page 1of 5

Sentiment Analysis using Linear Regression

Shiry Ginosar

Avital Steinitz

February 23, 2012


In this assignment we learn a linear model for determining the rating of textual book reviews from using linear regression. Despite its simplicity, the linear model still performs fairly
well. However, a clever choice of the various ingredients of the model, such as features selection
and regularization term, could further improve the its accuracy. In our work we study Unigram
feature selection using (Boolean) Information Gain and we compare the performance of the resulting
Boolean and Multinomial models. We observe that the importance of including multiple features
is more significant than the difference between the Boolean and the Multinomial models.
Another major consideration in such large data mining tasks is efficient representation and
manipulation of the data. Implementing ones own conjugate descent algorithm (a fascinating
research area on its own right) is appealing as it could yield a very fast and efficient solver, however
in favor of focusing on evaluating the model we chose to use the sparse data structures in Matlab,
and we report some statistics we collected about their efficiency.


For this assignment we used the 975,194 book reviews collected by Mark Dredze
and others at Johns Hopkins [1]. The reviews consist each of multiple xml tags: review, asin,
product name, helpful, rating, title, date, reviewer, reviewer location and review text. We chose
a bag-of-words representation of the reviews, based only on the text in the title, reviewer and
review text tags (assuming that the other tags contain data that is irrelevant for rating). We
disregarded words occurring less than four times in the corpus, assuming that their informativeness
would be negligible anyway.
The target values we aim to predict are of course the ratings. The rating of a review may be 1, 2,
4 or 5 starts (3 stars reviews were removed from the data set). This particular corpus (and perhaps
amazon in general) is heavily skewed towards positive reviews, with 65% of the reviews rated 5
stars, another 22% of the reviews rated 4 stars, and only 5% and 7% of the reviews rated 2 and 1
stars (respectively). We later study how well our models distinguishes between the three natural
cuts in the data, separating {1} stars reviews from {2, 4, 5} stars reviews, {1, 2} stars reviews from
{4, 5} stars reviews, and {1, 2, 4} stars reviews from {5} stars reviews, and we observe that the last
classification task seems to be consistently the most difficult.

Feature selection

Prior to performing any feature selection we constructed a matrix X and a vector y as follows: The
ith row in the matrix X, consisted of a bag of words representation of the ith review (that is,
Xi,j = n if the jth word appears n times in the ith review), and the ith value in y is its rating.
To reduce the dimensionality of our data and remove noisy features (such as stop-words) we
performed feature selection using Information Gain. The Information Gain of the jth feature is
the value:
Ent(y) Ent(y|xj )
where y is the label of a single review x and xj is the (Boolean) occurrence of the jth feature in a
review, and where we compute the entropy of a discrete random variable V as the value
P (v) log(P (v)) .
Ent(V ) =
vvals(V )

According to the Information Gain we have computed, the 20 highest ranking features are: but,
waste, not, good, disappointing, boring, great, however, excellent, was, dont, money, bad, disappointed, interesting, worst, poorly, would, some and highly. This list nicely illustrates the reasonable performance of this features selection procedure, however it is interesting to note a couple of
surprisingly informative words (was, some).
We used the Information Gain features ranking to test the accuracy of different model sizes in
the following way: Given a model size N we slice the matrix X and leave only the columns that
correspond to the top N ranking features. Then, we run linear regression on that sliced matrix
and test the accuracy of the resulting classifier. The results of this experiment are presented in the
following sections.

Linear Regression and Model Size

We study ridge regression with cost parameter 0.001 (which was chosen arbitrarily after observing
that various choices of cost values did not seem to have an impact on the models accuracy). We
computed the parameters using the division operator \:
= 0.001;
A = X T X + I;
= A\(X T y);
According to the documentation, the use of \ avoids the explicit computation of the matrix inverse.
We study the efficiency of the \ operation by registering the time it took to compute for various
model sizes. Figure 1 shows the computation time as a function of the model size (the square
number of features) in the Binomial case and Multinomial case. The figure nicely illustrates the
linear dependence between the time complexity of the regression computation and the model size.
Figures 2, 3 and 4 illustrate the accuracy of the learned linear regression classifiers, while
comparing the binomial and the multinomial models as well as the quality of the 3 cuts induced
by the classifiers: The cut separating {1} stars from {2, 4, 5} stars, {1, 2} stars from {4, 5} stars
and {1, 2, 4} stars from {5} stars. We maintain a uniform color-code in all three figures, where the

Figure 1: Ridge regression computation time as a function of the model size in the Binomial case
and Multinomial case. The time values are averages over 10-fold cross validation.

Figure 2: The ROC curves corresponding to the 3 cuts induced by the 70,000 features linear
classifier: one curve per fold per cut for the Binomial (left) and Multinomial (right) models.
{1}{2, 4, 5} related curves are in blue, {1, 2}{4, 5} related curves are in green and {1, 2, 4}{5}
related curves are in red. Figure 2 illustrates the evaluation of the quality of the linear regression
classifiers fitted to a 70,000 features model. The figure summarizes the results of a 10-fold cross
validation in 30 ROC curves: one ROC curve per cut per fold. Figure 3 shows the average area
under the ROC curve corresponding to various model sizes. Figure 4 compares their corresponding
average 1% lift values.
In all figures the Boolean model seems to perform slightly (and rather insignificantly) better
than the Multinomial model, a common phenomenon in sentiment analysis (see for example [2]).
Also, Figure 4 seems to imply that larger models to not necessarily improve the 1% lift value, which
we find interesting. Much more significant is the improvement in the AUC values as the size of the
model grows, which implies that in sentiment analysis every word is a significant feature.

Figure 3: The average AUC as a function of the model size in the Binomial case and Multinomial

Figure 4: The 1% lift value as a function of the model size in the Binomial case and Multinomial


A significant observation that can be derived from our results is the following: the ROC curves tell
us that the {1, 2, 4} {5} estimation (red curve) is the hardest task for our regression. Intuitively
this makes sense, as also for humans it would be hardest to distinguish between 4 and 5 star reviews.
However, the reason may just as well be that our regression is only capable of distinguishing positive
reviews from negative reviews (this is also the explanation for the good performance represented
by the blue curve). This alternative explanation is also supported by the flatness of the lift curve.
It could be interesting to further investigate applications of linear regression to this specific kind
of rating estimation in order to better understand how well linear regression models this domain.

[1] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and
blenders: Domain adaptation for sentiment classification. In Association for Computational
Linguistics, Prague, Czech Republic, 2007.
[2] Vangelis Metsis Telecommunications and Vangelis Metsis. Spam filtering with naive bayes
which naive bayes? In Third Conference on Email and Anti-Spam (CEAS, 2006.