0 views

Uploaded by David Alexis Arboleda

sdfds

- Repeatability Standard Deviation
- PMIR
- MMP using ACE
- A Survey on Machine Learning Classification Techniques
- Habitat Suitability Assessment for the Endangered Nilgiri Laughing Thrush
- Siam Talk 2003
- bot.bb
- Advanced Modelling Techniques Anurag Payel
- Ashish - PhD Defense
- INDIVIDUAL EMOTION RECOGNITION AND SUBGROUP ANALYSIS FROM PSYCHOPHYSIOLOGICAL SIGNALS
- M2_Intro to EViews
- pm7
- ijnlc020304
- Deliverable03 Tweet Miner
- Naive Bayes
- Outline S 2014
- 1-s2.0-S0022202X1540939X-main
- Zeng 2018
- s12544-017-0242-z
- SPSS.regression.pc

You are on page 1of 5

Shiry Ginosar

Avital Steinitz

Introduction

In this assignment we learn a linear model for determining the rating of textual book reviews from

amazon.com using linear regression. Despite its simplicity, the linear model still performs fairly

well. However, a clever choice of the various ingredients of the model, such as features selection

and regularization term, could further improve the its accuracy. In our work we study Unigram

feature selection using (Boolean) Information Gain and we compare the performance of the resulting

Boolean and Multinomial models. We observe that the importance of including multiple features

is more significant than the difference between the Boolean and the Multinomial models.

Another major consideration in such large data mining tasks is efficient representation and

manipulation of the data. Implementing ones own conjugate descent algorithm (a fascinating

research area on its own right) is appealing as it could yield a very fast and efficient solver, however

in favor of focusing on evaluating the model we chose to use the sparse data structures in Matlab,

and we report some statistics we collected about their efficiency.

Data

For this assignment we used the 975,194 amazon.com book reviews collected by Mark Dredze

and others at Johns Hopkins [1]. The reviews consist each of multiple xml tags: review, asin,

product name, helpful, rating, title, date, reviewer, reviewer location and review text. We chose

a bag-of-words representation of the reviews, based only on the text in the title, reviewer and

review text tags (assuming that the other tags contain data that is irrelevant for rating). We

disregarded words occurring less than four times in the corpus, assuming that their informativeness

would be negligible anyway.

The target values we aim to predict are of course the ratings. The rating of a review may be 1, 2,

4 or 5 starts (3 stars reviews were removed from the data set). This particular corpus (and perhaps

amazon in general) is heavily skewed towards positive reviews, with 65% of the reviews rated 5

stars, another 22% of the reviews rated 4 stars, and only 5% and 7% of the reviews rated 2 and 1

stars (respectively). We later study how well our models distinguishes between the three natural

cuts in the data, separating {1} stars reviews from {2, 4, 5} stars reviews, {1, 2} stars reviews from

{4, 5} stars reviews, and {1, 2, 4} stars reviews from {5} stars reviews, and we observe that the last

classification task seems to be consistently the most difficult.

Feature selection

Prior to performing any feature selection we constructed a matrix X and a vector y as follows: The

ith row in the matrix X, consisted of a bag of words representation of the ith review (that is,

Xi,j = n if the jth word appears n times in the ith review), and the ith value in y is its rating.

To reduce the dimensionality of our data and remove noisy features (such as stop-words) we

performed feature selection using Information Gain. The Information Gain of the jth feature is

the value:

Ent(y) Ent(y|xj )

where y is the label of a single review x and xj is the (Boolean) occurrence of the jth feature in a

review, and where we compute the entropy of a discrete random variable V as the value

X

P (v) log(P (v)) .

Ent(V ) =

vvals(V )

According to the Information Gain we have computed, the 20 highest ranking features are: but,

waste, not, good, disappointing, boring, great, however, excellent, was, dont, money, bad, disappointed, interesting, worst, poorly, would, some and highly. This list nicely illustrates the reasonable performance of this features selection procedure, however it is interesting to note a couple of

surprisingly informative words (was, some).

We used the Information Gain features ranking to test the accuracy of different model sizes in

the following way: Given a model size N we slice the matrix X and leave only the columns that

correspond to the top N ranking features. Then, we run linear regression on that sliced matrix

and test the accuracy of the resulting classifier. The results of this experiment are presented in the

following sections.

We study ridge regression with cost parameter 0.001 (which was chosen arbitrarily after observing

that various choices of cost values did not seem to have an impact on the models accuracy). We

computed the parameters using the division operator \:

= 0.001;

A = X T X + I;

= A\(X T y);

According to the documentation, the use of \ avoids the explicit computation of the matrix inverse.

We study the efficiency of the \ operation by registering the time it took to compute for various

model sizes. Figure 1 shows the computation time as a function of the model size (the square

number of features) in the Binomial case and Multinomial case. The figure nicely illustrates the

linear dependence between the time complexity of the regression computation and the model size.

Figures 2, 3 and 4 illustrate the accuracy of the learned linear regression classifiers, while

comparing the binomial and the multinomial models as well as the quality of the 3 cuts induced

by the classifiers: The cut separating {1} stars from {2, 4, 5} stars, {1, 2} stars from {4, 5} stars

and {1, 2, 4} stars from {5} stars. We maintain a uniform color-code in all three figures, where the

2

Figure 1: Ridge regression computation time as a function of the model size in the Binomial case

and Multinomial case. The time values are averages over 10-fold cross validation.

Figure 2: The ROC curves corresponding to the 3 cuts induced by the 70,000 features linear

classifier: one curve per fold per cut for the Binomial (left) and Multinomial (right) models.

{1}{2, 4, 5} related curves are in blue, {1, 2}{4, 5} related curves are in green and {1, 2, 4}{5}

related curves are in red. Figure 2 illustrates the evaluation of the quality of the linear regression

classifiers fitted to a 70,000 features model. The figure summarizes the results of a 10-fold cross

validation in 30 ROC curves: one ROC curve per cut per fold. Figure 3 shows the average area

under the ROC curve corresponding to various model sizes. Figure 4 compares their corresponding

average 1% lift values.

In all figures the Boolean model seems to perform slightly (and rather insignificantly) better

than the Multinomial model, a common phenomenon in sentiment analysis (see for example [2]).

Also, Figure 4 seems to imply that larger models to not necessarily improve the 1% lift value, which

we find interesting. Much more significant is the improvement in the AUC values as the size of the

model grows, which implies that in sentiment analysis every word is a significant feature.

Figure 3: The average AUC as a function of the model size in the Binomial case and Multinomial

case.

Figure 4: The 1% lift value as a function of the model size in the Binomial case and Multinomial

case.

Discussion

A significant observation that can be derived from our results is the following: the ROC curves tell

us that the {1, 2, 4} {5} estimation (red curve) is the hardest task for our regression. Intuitively

this makes sense, as also for humans it would be hardest to distinguish between 4 and 5 star reviews.

However, the reason may just as well be that our regression is only capable of distinguishing positive

reviews from negative reviews (this is also the explanation for the good performance represented

by the blue curve). This alternative explanation is also supported by the flatness of the lift curve.

It could be interesting to further investigate applications of linear regression to this specific kind

of rating estimation in order to better understand how well linear regression models this domain.

References

[1] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and

blenders: Domain adaptation for sentiment classification. In Association for Computational

Linguistics, Prague, Czech Republic, 2007.

[2] Vangelis Metsis Telecommunications and Vangelis Metsis. Spam filtering with naive bayes

which naive bayes? In Third Conference on Email and Anti-Spam (CEAS, 2006.

- Repeatability Standard DeviationUploaded byBalram Ji
- PMIRUploaded byJoshua Joseph
- MMP using ACEUploaded byMaqsood Iqbal
- A Survey on Machine Learning Classification TechniquesUploaded byAnonymous vQrJlEN
- Siam Talk 2003Uploaded byyakeselman
- bot.bbUploaded byAji Resmi Nurdin
- Advanced Modelling Techniques Anurag PayelUploaded byanurag
- Habitat Suitability Assessment for the Endangered Nilgiri Laughing ThrushUploaded byRondang S.E Siregar
- Ashish - PhD DefenseUploaded byAshish Mangalampalli
- INDIVIDUAL EMOTION RECOGNITION AND SUBGROUP ANALYSIS FROM PSYCHOPHYSIOLOGICAL SIGNALSUploaded bysipij
- M2_Intro to EViewsUploaded byPedro Chávez
- pm7Uploaded byrobyxtrajinx
- ijnlc020304Uploaded byAnonymous qwgN0m7oO
- Deliverable03 Tweet MinerUploaded byNickleus Jimenez
- Naive BayesUploaded byLangit Cerah
- Outline S 2014Uploaded byCrystal B. Wong
- 1-s2.0-S0022202X1540939X-mainUploaded byrezadeptia
- Zeng 2018Uploaded byUsman Mushtaq
- s12544-017-0242-zUploaded byanik_kurp
- SPSS.regression.pcUploaded byReet Hansah
- statisticsUploaded byGoher Ayub
- ACEEUploaded byDrJayant Sangole
- Examining the Effect of FeatureUploaded byLewis Torres
- Labor Impact YouthUploaded byribeiro.aikido4736
- IFAC2005linUploaded byscrdld
- Chapter 4Uploaded byFaizan Nasir
- M PilkingtonUploaded byNitesh R Shahani
- 1-s2.0-S0022199698000087-mainUploaded byYuriko Tamanashi Shiguehari
- L03Uploaded byjegosss
- Algortimos en Sepsis y NEC - Early H Development Febrero 2018Uploaded byMariemma Tipiani

- ASTM Compasse E208 Standart Method Dor Conducting Drop Weight Test to Determinate Nil Ductilly TransitionUploaded byAlexandre Lara
- Mizo Hla Kungpui Mual Bulletin (March 13, 2011Uploaded byLehkhabukhawvel
- Networking - How Does IPv4 Subnetting WorkUploaded byEd Phénix le Prince
- K1 RootingUploaded byHassan Ballouz
- WW Latin America EnglishUploaded byPROBIOMA
- Chimei Innolux N101BGE Datasheet pdfUploaded byMichelle Boor
- ALV in a Pop Up Window and ALV in a Dialog BoxUploaded byRoberto Martínez
- Millare v. HernandoUploaded bysigfridmonte
- Act JusticeUploaded byKevin Scott
- Defence Industry AdsUploaded byTaijuSasaki
- C02 Leakage Through Exsisiting WellsUploaded byb4rf
- Pertemuan 7 Obat Teknik Khusus Inhaler Bag 1Uploaded byHamdi Putra
- Antonio Ramos Ruiz v. Administrator of Corrections and Parole Board of the Commonwealth of Puerto Rico, Luis Antonio Diaz Gonzalez v. Juan A. Del Valle, Etc., Reinaldo Rivera Torres v. Irba M. Cruz De Batista, Etc., 599 F.2d 485, 1st Cir. (1979)Uploaded byScribd Government Docs
- Assertiveness EssentialsUploaded bySrinathvr
- Centrifugal CompressorsUploaded byKarim Ben
- PT-100 RTD 4 Wire Sensor RosemountUploaded byVictor de Jesus
- Collins v. State of Kansas, 10th Cir. (2005)Uploaded byScribd Government Docs
- Airline Business Februarie 2013Uploaded bySchutzstaffelDH
- Documents.tips Ghmc 208 Go FormUploaded byappulal
- 02621710610670137Uploaded byMIan Muzamil
- The News-Review Encore - May 2012Uploaded byNews-Review of Roseburg Oregon
- OTTcity WatergateUploaded bymitch9438
- Geofmann PharmaceuticalsUploaded byzaki_singhera
- DCSS-ENUploaded byashvar9
- ERO - Las Vegas ICE Op Nets Numerous Arrests FINAL (5-16)Uploaded byJac Taylor
- PLC vs ControllerUploaded byththee
- Agriculture Classification System Using Differential Evolution AlgorithmUploaded byDr. D. Asir Antony Gnana Singh
- curved beam Design.docxUploaded byamarkiran vinayak
- vera_ug3.0Uploaded byKuani Lee
- Free PMP QuestionsUploaded byRamesh Deepala