Sie sind auf Seite 1von 9

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No.

7
ISSN: 1837-7823

Machine Learning Algorithms and their Significance in Sentiment Analysis


for Context Based Mining
N. KARTHIKEYAN1 and

R.DHANAPAL2

*Head (B.C.A Dept),


Department of Computer Applications,
Srimad Andavan Arts&Science College,
Tiruchitrapalli, Tamil Nadu, India
E-mail: karthi_badri@yahoomail.com
2*

Principal
K.C.S. Kasi Nadar College of Arts & Science,
R.K.Nagar
Chennai 600 021
URL: www.kcskasinadarcollege.in
E-mail: drdhanapal@gmail.com

Abstract
The process of sentiment analysis is a typical area which requires analysis of various parts of the text to provide
the appropriate results. Since text in general are unstructured, it becomes more difficult for the algorithm to
determine the result. This paper uses machine learning algorithms (Neural Networks and SVM) and J48
Classification algorithm to determine the best approach for determining the polarity of a document for sentiment
analysis. The results infer that SVM performs better than the other techniques in determining the document polarity.
Keywords: Context based mining, Sentiment analysis, SVM, ANN, J48

1. Introduction
In Content Based Image Retrieval (CBIR), we are concentrating on the aspect of retrieving images
corresponding to a query image. In usual text based image search, users will be providing some keywords based on
which images are retrieved. In case of text based search the ability of the user to provide an exact query is limited by
several factors like, colour, texture and such intricate details could not be represented in textual form in a consistent
manner. So the inability to provide proper input will automatically introduce bias or error in the output. So current
generation image search is based on images as input so that the match could be much better than providing text as
input.
The drawback of the current approach is that we are not searching the images in a single well defined context.
The image could be anything and should be matched with all other images in the repository before providing the
output. Image based search and matching has been successful in many domains that are context specific. Say Iris
Scan images when compared to a database containing only Iris images was very successful and similarly, facial
recognition, fingerprint readers etc. are all very reliable because of the fact that the images are all from a single well
defined context.
When it comes to a broad category of images then the drawback of providing an image as input and searching
for similar elements from a repository is that, the user is now handicapped because the context of the search is
missing. Say for example if the user is providing the image of a dog and searching through the repository, then the
context could be any of the following like pet, breed based search, police/sniffer dogs, trained dogs, helper dogs,
diseases suffered by dogs, food for dogs etc. So here by providing an image as input the user is unable to specify the
context that he/she is looking for in the image result.
Human way of looking at an image must be studied from a psychological point of view rather than considering
it as just reading all the pixels and trying to make sense out of it. Human vision or the perception of human vision to
be precise is based on the overall broad context and once we obtain the context then we ignore the local details. This
is completely different from a computerized program. Here semantics and context sensitiveness plays an important
4

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823
role. This brings out the need for filling the semantic gap in content based image retrieval. Concentrating on the low
level features alone makes the search results biased and error prone. Also the changes in the luminance or texture or
colour does not change the context of an image and we are looking for the context here.
The core concept in retrieving content from an image is currently based on pixel by pixel analysis of the image.
But human vision doesnt provide the same importance to all the pixels as a computer does. So in order to emulate
human vision through computers, the key is semantics. To provide such a semantic based image retrieval system the
repository as well as the query image must be accompanied by some metadata. Metadata here provides the context.
It could be keywords, descriptions and tags. Even sentiment polarity could be included to make the search much
more effective and context sensitive. Here in this paper we try to bridge the semantic gap by including the sentiment
polarity of the images in CBIR.
The remainder of this paper is structured as follows; section II provides

2. Related works
A lot of research has gone into content based image retrieval. ThijsWesterveld in [1] used Latent Semantic
Indexing to uncover hidden semantics. That work concentrates on including co-occurrence statistics to uncover the
hidden semantic information. The work tries to bring the best of both worlds, image feature (content) and words
(context) into one semantic space. Though the work showed better performance in terms of mono and multilingual
text retrieval, its application to multi-modal and cross modal image retrieval involves a lot of computational
complexity and also its subjectivity complicates the process further.
In [2] David et al proposed several views regarding the importance of context sensitiveness in image retrieval.
They have even quoted examples from newspapers that provides text as well as images in a biased manner favouring
a particular political or religious faction. They have introduced a new platform and a diversity engine architecture
for image retrieval based on opinion analysis, text analysis and content based information retrieval. Though they
have stressed the importance of semantics and context sensitiveness in image retrieval, they have only provided an
overview and have summarized the existing text, image and other multimedia based retrieval systems.
In [3] Liyan et al presented an approach that utilizes context information to learn adaptive rules for automatic
and human in the loop clustering. The work is a bit more context aware as it considers a particular domain of face
tagging and detection. The repository under consideration in their work consists only of human facial images and
hence the context sensitiveness to a broader class is found missing. Large scale context based retrieval of images
requires analysis of millions or even billions of images and hence computationally complex.
In [4] Thanh-Nghi Doan et al have proposed a parallel incremental methodology for power mean SVM based
classification of large scale image datasets and it is proved to handle 1000s of visual classes effectively. Such a
parallel approach towards context sensitive image retrieval could improve the performance and accuracy as well. It
also considers dealing with imbalanced data. In [5] David Ahlstrom et al have shown the effectiveness of simple and
sophisticated tools for video exploration. It provides insights from a real time video search competition for video
exploration.
The next step in web search is based on including users sentiment/opinion effectively and hence providing
context sensitive results. As suggested in [2], the importance of such sentiment analysis is on the rise as the text
mining systems are now being integrated along with multimedia based information retrieval systems. So it is no
more just text or image based search, instead a combination of them all resulting in better results that are reliable in
a wide variety of domains.
Several machine learning based methods are proposed for lexical analysis of text corpus and to infer sentiment
polarity from them. In [6] Blinov et al have proposed a machine learning approach based on Support Vector
Machines (SVM) and maximum entropy method. Their approach has included information about the proportion of
positive and negative words, their colocations, emoticons as such to better identify the context. But their approach is
based on manual formation of emotional dictionaries specifically made for each domain. Since such context based
emotional dictionaries are not so very widely available for all domains, it could not be a scalable solution for general
web based image retrieval systems.
Automated Text Classification is done based on machine learning approaches for a long time now. In [7]
Ikonomakis et al have provided a detailed study of the state of the art in automated text classification using machine
learning approaches. In [8] Stefano et al presented SentiWordNet 3.0 which is the latest edition of lexical resource
specifically designed for opinion mining and sentiment classification applications. The difference between the
5

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823
various versions of SentiWordNet and its features are also clearly explained along with the research applications of
such a lexical resource in various automated text classification and sentiment polarity analysis. They have also
mentioned the algorithm for automatic WordNet annotations and how it effectively classifies text into positive,
negative and neutral elements.
Rudy et al in [9] proposed a hybrid approach for sentiment analysis based on rule based classification,
supervised learning and machine learning. They have applied that to movie reviews and product reviews and
reported effective classification of sentiment polarity. Though the results are comparatively good the hybridization
increases the computational complexity of the approach to a greater extent.Bo Pang et al in [10] have considered
sentiment analysis based on positive and negative polarity alone and independent of topic. Naive Bayes, maximum
entropy classification, and support vector machines have been used for sentiment analysis by them and they have
also reported that machine learning approaches are better than human baseline when it comes to sentiment polarity.

3. System architecture
Text

C ontent Analysi s and Feature Vect or C reation

St op wor d Eli mi nation

Feature Matr ix Creation


.................
1
0
.................
1
1
...........
...........
...........
..................
11 1
1

10
01
...
...
...

C ontext Based Sentiment Analysi s usi ng Machi ne


Learni ng

Figure 1: System Architecture

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823
The process of context based image retrieval uses the base information available in the images to retrieve the
context in which they are being used. The context based image retrieval system functions in four phases. The initial
phase deals with analyzing the available data and creating a feature vector. These feature vectors are the information
that is a broken down form of the available data. In order to remove the unnecessary words and to shortlist the
mandatory words needed for the future process, the second phase is performed. This phase removes the stop words
and symbols from the feature vectors to make them more refined. After the process of refinement, the feature matrix
is created by using the reviews and feature vectors. This data serves as the base for performing the context based
sentiment analysis. Machine learning is used for performing this analysis and finding the classification. Figure 1
shows an overall system architecture of the sentiment analysis methodology.

4. Context Based Image Retrieval Using Machine Learning Approaches


The term context refers to perspective or situation. Content retrieval using context as the key has its own
complexities. The first and the foremost being sentiment retrieval from the data. In general, context directly refers to
the sentiments with which a certain text has been rendered. Emotion analysis is the next level of sentiment analysis.
While sentiment analysis refers to finding the polarity of the document (positive, negative or neutral), emotion
analysis takes a deeper plunge and refers to the level of emotions. Our methodology here classifies the images based
on the polarity of the text, using which the context can be retrieved. The following four phases describe the working
methodology of our system.

4.1. Content analysis and Feature Vector Creation


Content of an image can be directly derived using the structural elements of the image. But deriving the context
from an image is complex and is mostly inaccurate. Hence it is necessary to search for other means of data that
depict the context. This information is mostly found in the metadata and some part of the content that are at close
proximity to the image. Metadata here refers to tags, description or keywords corresponding to the image.
Hence the initial process in sentiment mining is the content analysis and feature vector creation. The content
present in the available information are analyzed and are tokenized and the word vector is created. Here, the word
vector is referred to as the feature vector. This vector contains information about the word and its frequency of
occurrence. After the completion of this phase, all the data corresponding to the text that is to be analyzed will be
listed.

4.2. Stop word elimination


Stop words refer to words that do not contribute to the meaning of a sentence. In short, these are connectors,
articles or pronouns. The major contributors in the process of sentiment mining would be the nouns, verbs, adverbs
or adjectives that directly talk about the activity taking place or determining the subject. All other words are mostly
useless, in other words, they tend to consume memory and reduces the processing speeds. Other types of stop words
include punctuations such as comma, full stop, colon, semicolon, question and exclamation.
The text that is considered for mining includes user provided unstructured data, which means, the data does not
have a proper format like a data from the database. Further, these data might not even be a proper English sentence.
There are very high possibilities of this text containing colloquial form of a language and it might even be multi
lingual. Even though our current methodology does not deal with multi lingual data, it could be performed in future.
The process of stop word elimination uses the stop word collection of the storm project [12,13,14]. The feature
vectors that were initially formed are filtered and the stop words occurring in them are eliminated. This removes a
considerable amount of data from the main feature vector set, hence enabling faster computation.

4.3. Feature matrix creation


The next phase is the creation of the feature matrix. This method maps the content with the already defined
feature vectors and creates a feature matrix. This phase creates an nm matrix, where n refers to the number of texts
considered for evaluation, and m refers to the number of items in the feature vector.

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823

a11 a1n

am1 amn

(1)

1 word j reviewi
aij =
(2)
0 Otherwise

Equation (1) shows a sample feature vector matrix, while equation (2) shows the conditions for populating the
feature matrix.
From equation (1) it can be made clear that the rows of the matrix refer to the table and each column refers to
each word determine from the feature vector. The matrix is populated in such a way that if the word occurs in the
given text, then 1 is added to the matrix, and if the word is not present in the given text, then an entry of 0 is added
to the matrix.
The feature matrix is generally found to be large and is used as the base for the machine learning algorithms.

4.4. Context based Sentiment Analysis using Machine Learning Algorithms


After the preprocessing and data preparation phases, the data becomes ready for the process of sentiment
analysis. Due to the problem nature, we determine that machine learning algorithms work best in the process of
sentiment analysis. In order for a machine learning system (supervised) to work best, it should be provided with the
appropriate training and test datum. The discussion here is mainly based on the supervised learning technique,
because the problem nature demands labeling of terms such that they can be used during future classifications.
Hence unsupervised methods might not work efficiently without any sort of training. Both the training and the test
data are labeled with their corresponding classes and are provided to the machine learning system.

5. Results and discussion


The data set that is being used is taken from the movie review data taken from [15]. The base form of this data
was used in [16] for polarity classification. This domain is experimentally convenient because when it comes to
reviews, we can expect a large amount of text and the review text as a whole describes the overall intention of the
user, which makes it an efficient data to be used for the purpose of classification. The original source of this data
was the Internet Movie Database (IMDb) archive of the rec.arts.movies.reviews newsgroups at [17]. The reviews
are categorized into positive and negative and are stored separately as training and test corpus.
This comparison technique focuses on machine learning approaches (Neural Networks and SVM) and J48
Classification algorithms.

Figure 2: Result of J48

Figure 2 shows the result obtained from the J48 Classifier.

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823

TPR

FPR
Figure 3: ROC for J48 (Positive Sentiment)

Figure 3shows the ROC plot for the positive sentiment. From the curve, it can be observed that the accuracy is
approximately 50%. J48 being a primitive classifier, it can be observed that the result obtained is average; hence we
can conclude that a machine learning approach would be a better option.

Figure 4: Result of ANN

Figure 4 shows the working of the neural network model. Due to the continually training approach and the very
large data size, the training time of the neural networks seems to be very high. And further, the error rate also seems
to be high. It can be observed from Figure 3 that the error rate is 2.133 and is error reduction rate is also found to be
very low. Hence the option of considering neural networks is eliminated. ENCOG framework is used for
9

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823
implementing the neural network model. The neural networks was constructed with three layers. The input and
output layers with no biased neurons, the processing layer with two biased neurons. The input layer was constructed
according to the number of words obtained after pre-processing. In our case it is 3190. Activation Linear and
Activation TanH functions were used in the input and, processing and output layers respectively. Resilient
propagation function was used to train the network. The network design is as follows (Table 1):
Table 1: Neural Network Setup

No Of Layers

No Of Neurons In Input Layer

3190

No Of Biased Neurons In The Input Layer

No Of Neurons In Processing Layer

3192

No Of Biased Neurons In The Processing 2


Layer
No Of Neurons In Output Layer

No Of Biased Neurons In The Output Layer

Activation Function Used In Input Layer

ActivationLinear

Activation Function Used In Processing ActivationTanH


Layer
Activation Function Used In Output Layer

ActivationTanH

Neural Network Training Function

Resilient Propagation

The same data set is considered and analysis is performed using SVM. It uses the RBF kernel function is used
for classification.

K ( xi , x j ) = exp( || xi x j || + r ), > 0
2

(3)

The SVM requires a special format for reading the data. The expected format of input for an SVM is
[label] [index1]:[value1] [index2]:[value2] ...

(4)

The values (value1, value2,valuen) in the given format are normalized within the range -1 to 1. In order to
convert the data into the required format, Max-Min Normalization is used, which is of the form,
(5)

A sample input data for SVM is of the form shown in figure 5.


10

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823

Figure 5: Sample input data for SVM

FPR

TPR
Figure 6: ROC for SVM

Figure 6 shows the ROC plot, which provides a promising accuracy. Hence after analysis of the results, SVM is
found to work efficiently for the process of context mining. Figure 7 shows the result obtained from SVM Classifier.

Figure 7: Result of SVM

6. Conclusion
This paper is an initial implementation for analysis of the available data with the classification algorithms and
to select the appropriate technique for the next level of analysis. Implementation is carried out using data obtained
from the IMDb dataset, and from the results it is clear that SVM works best on the area of context mining. This
process can be further improvised by using one class classification techniques rather than multi-class classification.
Further, our next research proposal will take forward this research into mining levels of polarities rather than
11

International Journal of Computational Intelligence and Information Security, October 2014 Vol. 5, No. 7
ISSN: 1837-7823
providing a single polarity base. Level of polarity can be analyzed and can be used for performing emotion analysis,
which is a deeper form of sentiment analysis.

7. References
[1]

ThijsWesterveld, (2000), Image Retrieval: Content versus Context, University of Twente, Department of
Computer Science, Parlevink Group,PO Box 217, 7500 AE Enschede, The Netherlands.

[2]

David Paul Dupplaw Michael Matthews Richard Johansson Giulia Boato Andrea Costanzo Marco
Fontani Enrico Minack Elena Demidova Roi Blanco Thomas Griffiths Paul Lewis Jonathon Hare
Alessandro Moschitti, (2014), Information extraction from multimedia web documents:an open-source
platform and testbed, Int J Multimed Info Retr 3:97111.

[3]

Liyan Zhang, Dmitri V. Kalashnikov, SharadMehrotra, (2014), Context Assisted Face Clustering
Frameworkwith Human-in-the-Loop, International Journal of Multimedia Information
Retrieval,Volume 3, Issue 2, pp 69-88.

[4]

Thanh-Nghi Doan,Thanh-Nghi Do, Francois Poulet, (2014), Parallel Incremental Power Mean SVM for
the Classificationof Large Scale Image Datasets, International Journal of Multimedia Information
Retrieval,Volume 3, Issue 2, pp 89-96.

[5]

Klaus
Schoeffmann,David
Ahlstrom,
Werner
Bailer,
ClaudiuCobarzan,FrankHopfgartner,KevinMcGuinness, CathalGurrin, ChristianFrisson, Duy-Dinh Le,
Manfred Del Fabro, HongliangBai, Wolfgang Weiss, (2014), The Video Browser Showdown: A Live
Evaluationof Interactive Video Search Tools, International Journal of Multimedia Information
Retrieval,Volume 3, Issue 2, pp 113-127.

[6]

Blinov P. D., Klekovkina M. V., Kotelnikov E. V., Pestov O. A. (2013), Research of lexical approach and
machine learning methods for sentiment analysis.

[7]

M. Ikonomakis, S. Kotsiantis, V. Tampakas, (2005), Text Classification Using Machine Learning


Techniques, Wseas Transactions On Computers, Issue 8, Volume 4, pp. 966-974.

[8]

Stefano Baccianella, Andrea Esuli, FabrizioSebastiani, (2010), SENTIWORDNET 3.0: An Enhanced


Lexical Resourcefor Sentiment Analysis and Opinion Mining, LREC. Vol. 10.

[9]

Rudy Prabowo, Mike Thelwall , (2009), Sentiment Analysis: A Combined Approach, Journal of
Informetrics 3.2 : 143-157.

[10]

Bo Pang,Lillian Lee, ShivakumarVaithyanathan,(2002), Thumbs up? Sentiment Classification using


Machine LearningTechniques, Proceedings of the ACL-02 conference on Empirical methods in natural
language processing-Volume 10.

[11]

Rajaraman, A.; Ullman, J. D. (2011). "Data Mining". Mining of Massive


17. doi:10.1017/CBO9781139058452.002. ISBN 9781139058452.

[12]

http://storm-project.net, Referred on: 3 Oct 2014.

[13]

https://github.com/nathanmarz/storm, Referred on: 3 Oct 2014.

[14]

https://github.com/nathanmarz/storm/wiki, Referred on: 3 Oct 2014.

[15]

http://www.cs.cornell.edu/People/pabo/movie-review-data, Referred on: 3 Oct 2014.

[16]

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. (2002), "Thumbs up? Sentiment classification using
machine learning techniques." Proceedings of the ACL-02 conference on Empirical methods in natural
language processing-Volume 10.

[17]

http://reviews.imdb.com/Reviews, Referred on: 3 Oct 2014.

Datasets.

pp. 1

12

Das könnte Ihnen auch gefallen