Feature Dependent Method For Sentiment Analysis Text Mining 2014

cPGCON 2014, Third Post Graduate and Research Scholar Symposium, University of Pune
Ms. Neha S. J oshi

Department of Computer Engineering
Progressive Education Societys Modern College of
Engg., Shivajinagar,
Pune, India
neha05101990@gmail.com
AbstractThese days we highly consider opinions of friends,
domain experts for decision making in day todays life. For
example, which brand is best for certain product, whether the
current movie is good, whether product gives better performance
or not, how many ratings are given to travelling site. Opinion
mining, also known as Sentiment analysis plays an important role
in this process. It is the study of emotions i.e. Sentiments,
Expressions that are stated in natural language. Natural
language techniques are applied to extract emotions from
unstructured data. In this paper, a feature level analysis is
considered which is known as fine grained analysis and takes
each and every entity in review and its corresponding polarity. In
proposed work, Artificial Neural Network approach with
Jaccard similarity measure is presented. Jaccard similarity
measure performs well in measuring the similarity of words
when comparing with each letter of the word. This approach is
similar to existing Support Vector Machine (SVM) approach. But
SVM approach has certain disadvantages like limited parameter
selection. Also when more number of features are selected then
its performance is degraded. Hence new approach with similarity
measure is proposed.
Index Terms Classification, Machine learning, Natural
Language Processing (NLP), Opinion mining, Parts of speech
tags, Sentiment analysis, Semi-Supervised learning, Sentiment
Classification, Support Vector Machine (SVM), Term weighting,
Polarity.
I. INTRODUCTION
Today people not only comment on the existing
information, bookmark pages, and provide ratings, but they
also share their ideas, news and knowledge with the
community at large.The main aim of information gathering is
to analyze what other people think. The World Wide Web
contains large amount of massive data or unstructured data.
With increasing popularity of opinion-base websites and other
resources new challenges has been arrived in opinion mining.
It is now becoming evident that the views expressed on the
web can be influential to readers in forming their opinions on
some topic. Similarly, the opinions expressed by users are an
important factor taken into consideration by product vendors,
and policy makers. There are a number of differences in
meaning between emotions, sentiments and opinions. The
most notable one is that opinion is transitional concept, which
Prof. Mrs. Suhasini A. Itkar
Department of Computer Engineering
Progressive Education Societys Modern College of
Engg., Shivajinagar,
Pune, India
suhasini_naik@yahoo.com
reflect our attitude towards something. On the other hand,
sentiments are different from opinions in that they reflect our
feeling or emotion, not always directed towards something.
Further still, our emotions may reflect our attitudes. Sentiment
analysis also known as opinion mining plays an important role
in determining the direction of sentiments also known as
polarity. It is currently significant trend in natural language
processing. As opinions are expressed in natural language, it
involves machine learning processing i.e. to give artificial
intelligence to computers. Opinion mining extracts emotions,
sentiments, opinions from the document corpus and analyzes
them.
There are different machine learning approaches used to
analyze opinions whiz. Supervised learning and Unsupervised
learning. For large scale sentiment analysis unsupervised
learning method is used. Supervised machine learning
techniques used training the sample data set and later testing
its subset. Amongst all other supervised learning approaches,
SVM gives maximum accuracy when used with unigrams. But
again it has few disadvantages. In this paper, design of
proposed approach and implementation details are presented.
II. RELATED WORK
Basically there are three main levels of sentiment
analysis namely, Document level analysis, Sentence level
analysis and Feature level analysis. In Document level[1]
analysis and Sentence level analysis one cannot identify
reviewers likes or dislikes on specific feature of that object. It
has been found that document level and sentence level
classification are not enough to identify each and every one
detail about sentiments expressed in a document as sentiments
may be expressed with respect to different features. In Feature
level method algorithm with parts of speech tags is used to
improve the accuracy on the benchmark dataset. It is fine-
grained analysis process which takes every feature of object
into consideration [2].
Abd. Samad Hasan Basaria, Burairah Hussina, I. et
al.[3] proposed a new approach which takes both SVM and
SVM with particle swarm optimization (PSO) into
consideration. Experiments are carried out to compare the
performance and accuracy of both approaches. It has been
found that SVM-PSO gives better solution in case of accuracy
A Feature Dependent Method for Sentiment
Analysis to understand User Context in Web
and precision that SVM in case data without cleansing case.
But SVM gives better recall factor than SVM-PSO.
Rudy Prabowo1, Mike Thelwall[4], presented a Hybrid
classification which combines all classification approaches
whiz. Rule based classification, Statistics Based Classifier and
SVM together to give better performance. It applies all
classifiers in sequence. There can be any set of configuration
sets. For example Statistics Based Classifier (SBC)-SVM,
Induction Rule Based Classifier-SBC- SVM. Generally SVM
classifier is placed last as the SVM classifier was placed last
because all the documents classified by the SVM were
classified into either positive or negative. Hence, it did not
give another classifier the chance to carry out a classification
once applied. The disadvantage of this approach is that if
numbers of samples are not sufficient then this makes SVM
weak to test next sample subsets. Michelle Annett and
Grzegorz Kondrak[5] proposed a novel approach based on
SVM. They worked on feature vectors of different size,
representation and types. Comparison of this new approach
with exiting approaches is carried out. It is concluded that
types of feature vector chosen has greater impact on accuracy
of classifier.
In [6, 7] they pose the problem of grouping feature
synonyms which is a current researched topic and a daunting
one analytically. The idea is to acquire useful and possibly
more explanatory synonyms for features that will make
analysis more robust and easier given a training data set.
However, we are usually interested in summarizing all reviews
of a particular movie, meaning these features and feature
synonyms are often not given. For example, reviews contained
in places such as [16] and [17] give users an overall score of a
movie that essentially summarize how good reviewers on
average think a movie is. This does not summarize certain
features or attributes of the movie, but simply summarizes the
movie as a whole to tell a user whether or not a movie comes
recommended.
Hogenboom et al. [8] proposed a method which considers
the negation scope and strength of a word while classifying
whether a word has positive or negative effect on the sentence.
For example, let us consider two sentences I am happy with
your performance and I am not that happy with your
performance. The first sentence expresses a positive emotion.
If we just consider the negative keyword not then the second
sentence would be equivalent to I am not happy with your
performance which is not correct. If scope and strength of the
negative keywords are considered while deciding its effect
then it would give better results. The proposed approach uses
two algorithms; the first one is used to calculate sentence
score for each word. In the second algorithm, the sentence
score is calculated using the word sense and word score with
respect to each negative keyword. If the calculated sentence
score is less than zero, then it is assigned to a negative class.
Kechaou et al. [9] proposed an approach to evaluate a
users opinion on e-learning systems. Three feature selection
methods MI (Mutual Information), IG (Information Gain), and
CHI statistics (CHI) have been examined and advanced along
with their proper HMM and SVM-based hybrid learning
method. Their results showed that IG (Information Gain)
performed the best. Applying data mining techniques on e-
learning reviews and studying e-learning blogs are some of the
challenges faced in improving the accuracy of the proposed
system further.
III. PROPOSED WORK
From the work done previously and the existing approaches it
is clear that SVM approach has many disadvantages though it
gives better performance than other approaches viz. Nave
Bayes, Maximum Entropy etc. The proposed framework
presents an approach which combines advantages of similarity
measure and Artificial Neural Network (ANN) together to
give better accuracy and efficiency than SVM approach.
Further, in similarity measure there are two methods namely
Jaccard & Dice and Cosine similarity. We can use any among
these. The following diagram describes the flow of proposed
method.
The document corpus is a data collection from twitter, blogs
etc. It is input to the data pre-processing step. Since data
contain several syntactic features that may not be useful for
machine learning, the data needs to be cleaned such as @ (at)
for link to username, url or link website (http, url, www),
(hashtag), RT(for retweet). A module that allows option of
different cleaning operations is designed. In Case
Normalization Most English texts (and other Romance
languages) are published in combined case that is, published
text contains both higher and lowercase characters.
Figure 1.Block Diagramof Proposed System
The process is to turn the entire document or sentences into
lowercase one. Tokenization is splitting up the systems of text
into personal terms or tokens. This procedure can take many
types, with regards to the terminology being examined. For
English, an uncomplicated and effective tokenization
technique is to use white space and punctuation as token
delimiters. Stemming is the procedure of decreasing relevant
tokens into a single type. Typically the stemming procedure
contains the recognition and elimination of prefixes, suffixes,
and unsuitable pluralization. Generate n-grams character n
grams are n nearby figures from a given feedback sequence.
For example, a 3-gram of phrase TERM can be {T,-TE, TER,
ERM etc. N-grams of single dimension is known as unigram,
2 dimension is known as bigrams and so on. Term frequency
is discovered by basically keeping track of frequent that a
given phrase has took place in a given document, and inverse
document frequency is discovered by splitting the amount of
records that given term seems to be in. When these principles
are increased together we get a ranking that is maximum for
terms that appear regularly in a few records, and low for
conditions that appear regularly in every document, enabling
us to discover conditions that are essential in a document.
Finally transformed data set is generated which is use for
training.
Figure 2. Training and Testing Flow
Algorithmic Approach:-
The process starts with finding important keywords in
documents and removing irrelevant words. A TF-IDF
approach is used initially. The formal procedure for
implementing TF-IDF has some minor differences over all its
applications, but the overall approach works as follows. Given
a document collection D, a word w, and an individual
document d D, we calculate
wd =fw, d * log (|D|/fw, D)
where fw, d equals the number of times w appears in d, |D| is
the size of the corpus, and fw, D equals the number of
documents in which w appears in D. The efficiency is O (n).
Once we get important terms in documents then similarity
measure is applied. Here Jaccard similarity measure is used
which is binary distinguisher and distinct two or more objects.
The jaccard similarity is defined as follows:
| A B |
JS(A,B) = ------------------
| A U B |
Which gives information about how A and B are similar.
Finally we get a feature set depending on its similarity i.e.
positive and negative. Now Artificial Neural Network (ANN)
algorithm is used for training purpose. One specific benefit
that these models have over SVMs is that their size is fixed:
they are parametric models, while SVMs are non-parametric.
That is, in an ANN there is a bunch of hidden layers with sizes
h
1
through h
n
depending on the number of features, plus bias
parameters, and those make up the required training model.
Implementation Details
This proposed work is implemented by designing following
different modules.
1) Collecting dataset.
2) Pre-processing and storing domain specific keywords.
3) Calculating TF-IDF.
4) Similarity measure.
5) Feature Extraction.
6) Training
7) Classification and Analysis.
Datasets
Experiments are carried on movies reviews dataset which are
taken from amazon.com. Each dataset consists of 100 reviews
that were classified in terms of the overall orientation as being
either positive or negative (50 positive and 50 negative
reviews). The ground truth was obtained according to the
customer 5-stars rating. Reviews with more than 3 stars were
defined as being positive and reviews with less than 3 stars
were labeled as being negative[19].
Performance Measurement
The classification performance can be evaluated in three terms
accuracy, recall and precision as defined below. A confusion
matrix is used for this.
Machine says yes Machine says no
Human says yes True positive False negative
Human says no False positive True negative
Table 1. Confusion Matrix Table
True positive samples +True Negative samples
Accuracy=-----------------------------------------------------------
Total number of samples
True positive sample
Recall= ---------------------------------------------------------
True positive samples+false negative samples
True positive sample
Precision= ------------------------------------------
True positive sample+false positive samples
Expected Result
The proposed methodology should lead to better accuracy
results as well it should be implemented in less computational
complexity, which is major disadvantage of SVM. Also, it
should be stand for maximumnumber of samples sets.
IV.CONCLUSION
Support Vector Machine (SVM) has been widely and
successfully used in sentiment analysis. Artificial neural
network (ANNs) has attracted little attention as an approach
for sentiment learning. Literature has been reported the
disadvantages of SVM approach that it cannot stand for more
number of features and samples. To the best of my knowledge
ANN gives better accuracy as SVM but in the worst case, the
number of support vectors is exactly the number of training
samples (though that mainly occurs with small training sets or
in degenerate cases) and in general its model size scales
linearly. In natural language processing, SVM classifiers with
tens of thousands of support vectors, each having hundreds of
thousands of features, is not unheard of. Thus, ANN is chosen.
V.ACKNOWLEDGMENT
Foremost, I would like to express my sincere gratitude to my
guide Prof. Mrs.S.A.Itkar for her continuous support, for her
patience, motivation, enthusiasm, and immense knowledge.
Her guidance helped me in all the time of research and writing
of this paper. Besides my guide, I would like to thank our
M.E.Coordinator Prof. Ms. D. V. Gore for her encouragement,
insightful comments, and valuable guidance.
VI.REFERENCES
[1] Sowmya Kamath S, Anusha Bagalkotkar, Ashesh
Khandelwal, Shivam Pandey, Kumari Poornima, Sentiment
Analysis Based Approaches for Understanding User Context
in Web Content, 978-0-7695-4958-3/13, 2013 IEEE.
[2] Bing Liu, Sentiment Analysis and Opinion Mining,
Morgan & Claypool Publishers, May 2012.
[3] Abd. Samad Hasan Basaria, Burairah Hussina, I. Gede
Pramudya Anantaa, J unta Zeniarjab, Opinion Mining of
Movie Review using Hybrid Method of Support Vector
Machine and Particle Swarm Optimization, Procedia
Engineering 53 ( 2013 ) 453 462
[4] Rudy Prabowo1, Mike Thelwall, Sentiment Analysis: A
Combined Approach
[5] Michelle Annett and Grzegorz Kondrak, "A Comparison
of Sentiment Analysis Techniques: Polarizing Movie Blogs".
[6]B. Liu. \Web Data Mining: Exploring hyperlinks, contents,
and usage data," Opinion Mining. Springer, 2007.
[7] B. Pang & L. Lee. Opinion Mining and Sentiment
Analysis." Foundations and Trends in Information Retrieval.
Vol. 2, Nos. 1-2. pp.1-135, 2008.
[8] Hogenboom, A.; van Iterson, P.; Heerschop, B.; Frasincar,
F.Kaymak, U. , "Determining negation scope and strength in
sentiment analysis," Systems, Man, and Cybernetics (SMC),
2011 IEEE International Conference on , vol., no., pp.2589-
2594, 9-12
[9] Kechaou, Z.; Ben Ammar, M.; Alimi, A.M.; , "Improving
e-learning with sentiment analysis of users' opinions," Global
Engineering Education Conference (EDUCON), 2011 IEEE ,
vol., no., pp.1032-1038, 4-6 April 2011
[10] Wenying ZHENG, Qiang YE. "Sentiment Classification
of Chinese Traveler Reviews by Support Vector Machine
Algorithm". Third International Symposium on Intelligent
Information Technology Application,2009..
[11] YuanbinWu, Qi Zhang, Xuanjing Huang, LideWu,
Phrase Dependency Parsing for Opinion Mining. Proceedings
of the 2009 Conference on Empirical Methods in Natural
Language Processing, pages 15331541,Singapore, 6-7 August
2009. c 2009 ACL and AFNLP.
[12] Rudy Prabowo, Mike Thelwall. "Sentiment Analysis: A
Combined Approach". White paper.
[13] Bo Pang and Lillian Lee,Shivakumar Vaithyanathan.
"Thumbs up? Sentiment Classification using Machine
Learning Techniques". In Proceedings of EMNLP 2002,
pp.50-57.
[14] G. Salton and C. Buckley, Term-weighting approaches
in automatic text retrieval, Information Processing &
Management, vol. 24, issue.5: 513523, 1988.
[15] Zhang, J . Kawai, Y. Nakajima, S. Matsumoto, Y. Tanaka,
K.,"Sentiment Bias Detection in Support of News Credibility
J udgment," System Sciences (HICSS), 2011 44th Hawaii
International Conference on , vol., no., pp.1-10, 4-7 J an. 2011
[16] Kechaou, Z., Ben Ammar, M. Alimi , "Improving e-
learning with sentiment analysis of users' opinions," Global
Engineering Education Conference (EDUCON), 2011 IEEE ,
vol., no., pp.1032-1038, 4-6 April 2011
[17] B. J . J ensen, M. Zhang, K. Sobel, and A. Chowdury,
Twitter power: Tweets as electronic word of mouth, Journal
of the American Society for Information Science and
Technology, vol. 60, no. 11, pp. 21692188, 2009
[18] ROTTEN TOMATOES: Movies - New Movie Reviews
and Previews." http://www.rottentomatoes.com.
[19] Metacritic - Movie Reviews, TV Reviews, Game reviews,
and Music Reviews." http://www.metacritic.com.
[20] Alexandra BALAHUR, Andrs MONTOYO. "A Feature
Dependent Method for Opinion Mining and Classification".
978-1-4244-2780-2/08/ 2008 IEEE
[21] Sowmya Kamath S,Anusha Bagalkotkar,Ashesh
Khandelwal. "Sentiment Analysis Based Approaches for
Understanding User Context in Web Content". International
Conference on Communication Systems and Network
Technologies,2013
[22] Liu Gongshen, Lai Huoyao,Luo J un, Lin J iuchuan.
"Predicting the Semantic Orientation of Movie Reviews".
Seventh International Conference on Fuzzy Systems and
Knowledge Discovery (FSKD 2010).
[23] Gang Li,Fei Liu. "A Clustering-based Approach on
Sentiment Analysis". 978-1-4244-6793-8/10/2010 IEEE.
[24] Mikalai Tsytsarau , Themis Palpanas, Survey on mining
subjective data on the web, Data Min Knowl Disc (2012)
24:478514 DOI 10.1007/s10618-011-0238-6, Springer

Feature Dependent Method For Sentiment Analysis Text Mining 2014

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Feature Dependent Method For Sentiment Analysis Text Mining 2014

Hochgeladen von

Copyright:

Verfügbare Formate

cPGCON 2014, Third Post Graduate and Research Scholar Symposium, University of Pune

Ms. Neha S. J oshi

Das könnte Ihnen auch gefallen