Sie sind auf Seite 1von 23

Sentiment Analysis

Data

CHAPTER 1
INTRODUCTION
The amount of raw data stored in corporate databases is exploding. In
todays fiercely competitive business environment, companies need to
rapidly turn these raw data into significant insights. Data mining, or
knowledge discovery, is the computer-assisted process of digging through
and analyzing enormous sets of data and then extracting the meaning of the
data. Data mining tools predict behaviors and future trends, allowing
businesses to make proactive, knowledge-driven decisions. The analysis of large
volumes of data with data mining methods is generally regarded as a field for specialists. The
latter create more or less complex analysis processes with often shockingly expensive software
solutions for predicting the imminent handing in of notices or the sales figures of a product for
example. The economic benefit is obvious, and so it was thought for a long time that the use of
data mining software products was also associated with high software license costs and the
support often necessary due to the complexity of the subject matter.
Probably no later than when the open source software RapidMiner was developed could
anybody seriously doubt that software solutions for data mining did not have to be expensive or
difficult to use.
There are different data mining tools like Weka, Orange, rattle, KNIME etc which are available
in open source. In this group rapid miner stands out with its efficient performance. Today
RapidMiner is the worldwide leading opensource data mining solution due to the combination
of its leadingedge technologies and its functional range. Applications of RapidMiner cover a
wide range of realworld data mining tasks.

Page 1

Sentiment Analysis

Data

1.1 The Tool


RapidMiner is licensed under the GNU Affero General Public License version 3 and is
currently available in version 5.3. RapidMiner contains more than 500 operators altogether for
all tasks of professional data analysis, i.e. operators for input and output as well as data
processing (ETL), modeling and other aspects of data mining. But also methods of text mining,
web mining, the automatic sentiment analysis from Internet discussion forums (sentiment
analysis, opinion mining) as well as the time series analysis and - prediction are available to the
analyst.
In addition, RapidMiner contains more than 20 methods to also visualize high-dimensional data
and models. Moreover, all learning methods and weighting factors of the Weka Toolbox have
also been completely and smoothly integrated into RapidMiner, meaning that the complete range
of functions of Weka, which is equally widespread in research at the moment, also joins the
already enormous range of functions of RapidMiner.

Page 2

Data

Sentiment Analysis

CHAPTER 2

2.2 Installation
Download the appropriate installation package for your operating system and install
RapidMiner according to the instructions on the website. All usual Windows versions are
supported as well as Macintosh, Linux or UNIX systems. Download is available from
http://www.rapid-i.com.

2.3 Perspective and views


When you open you will be welcomed into the so-called Welcome Perspective. The upper
section shows typical actions which you as an analyst will perform frequently after starting
RapidMiner. Here are the details of these:
1. New: Starts a new analysis process. First you must define a location and a name within the
process and data repository and then you will be able to start designing a new process.
2. Open Recent: Opens the process which is selected in the list below the actions. Alternatively,
you can also open this process by double-clicking inside the list. Either way, RapidMiner will
then automatically switch to the Design Perspective.
3. Open: Opens the repository browser and allows you to select a process to be opened within
the process Design Perspective.
4. Open Template: Shows a selection of different pre-defined analysis processes, which can be
configured in a few clicks.
5. Online Tutorial: Starts a tutorial which can be used directly within Rapid-Miner and gives an
introduction to some data mining concepts using a selection of analysis processes.

Page 3

Data

Sentiment Analysis

Figure 1: Welcome Perspective of RapidMiner.


We will find an icon for each perspective within the right-hand area of the toolbar:

Figure 2: Toolbar Icons for Perspectives

Page 4

Sentiment Analysis

Data

The icons shown here take you to the following perspectives:


1. Design Perspective: This is the central RapidMiner perspective where all analysis processes
are created and managed.
2. Result Perspective: If a process supplies results in the form of data or models then
RapidMiner takes you to this Result Perspective, where you can look at several results at the
same time.
3. Welcome Perspective: The Welcome Perspective already described above, which RapidMiner
welcomes you with after starting the program.
You can switch to the desired perspective by clicking inside the toolbar or alternatively
via the menu entry View"- Perspectives" followed by the selection of the target perspective.
RapidMiner will eventually also ask you automatically if switching to another perspective seems
a good idea, e.g. to the Result Perspective on completing an analysis process.

Design Perspective
Since the Design Perspective is the central working environment of RapidMiner, we will
discuss all parts of the Design Perspective separately in the following and discuss the
fundamental functionalities of the associated views. There are two very central views in this area,
at least in the standard setting.

Page 5

Sentiment Analysis

Data

Figure 3: Design Perspective of RapidMiner

Operators View
All work steps (operators) available in RapidMiner are presented in groups here and can
therefore be included in the current process. You can navigate within the groups in a simple
manner and browse in the operators provided to your heart's desire. If RapidMiner has been
extended with one of the available extensions, then the additional operators can also be found
here.

Page 6

Sentiment Analysis

Data

Without extensions you will find at least the following groups of operators in the tree
structure:
1.

Process Control: Operators such as loops or conditional branches which can control the

process flow.
2. Utility: Auxiliary operators which, alongside the operator Subprocess" for grouping sub
processes, also contain the important macro-operators as well as the operators for
logging.
3. Repository Access: Contains the two operators for read and write access in repositories.
4. Import: Contains a large number of operators in order to read data and objects from
external formats such as files, databases etc.
5. Export: Contains a large number of operators for writing data and objects into external
formats such as files, databases etc.
6. Data Transformation: Probably the most important group in the analysis in terms of size
and relevance. All operators are located here for transforming both data and meta data.
7. Modeling: Contains the actual data mining process such as classification methods,
regression methods, clustering, weightings, methods for association rules, correlation and
similarity analyses as well as operators, in order to apply the generated models to new
data sets.
8. Evaluation: Operators using which one can compute the quality of a modeling and thus
for new data e.g. cross-validations, bootstrapping etc.
You can select operators within the Operators View and add them in the desired place in the
process by simply dragging and dropping.

Repositories View
Page 7

Sentiment Analysis

Data

The repository is a central component of RapidMiner which was introduced in Version 5.


It serves for the management and structuring of your analysis processes into projects and at the
same time as both a source of data as well as of the associated meta data.

Process View
The Process View shows the individual steps within the analysis process as well as their
interconnections.

Inserting Operators
You can insert new operators into the process in different ways. Here are the details of the
different ways:
1. Via drag &drop from the Operators View as described above,
2. Via double click on an operator in the Operators View,
3. Via dialog which is opened by means of the first icon in the toolbar of the Process View,
4. Via dialog which is opened by means of the menu entry Edit" - New Operator. . . (CTRL-I),
5. Via context menu in a free area of the white process area and there via the submenu\New
Operator" and the selection of an operator.

Parameters View
Page 8

Sentiment Analysis

Data

Numerous operators require one or several parameters to be indicated for a correct


functionality. For example, operators that read data from files require the file path to be
indicated. Note that some parameters are only indicated when other parameters have a certain
value. For example, an absolute number of desired examples can only be indicated for the
operator \sampling" when \absolute" has been selected as the type of sampling.

Help and Comment View


Each time you select an operator in the Operators View or in the Process View, the help
window within the Help View shows a description of this operator. These descriptions include
1. A short synopsis which summarizes the function of the operator in one or a few sentences,
2. A detailed description of the functionality of the operator,
3. A list of all parameters including a short description of the parameter, the default value (if
available), the indication as to whether this parameter is an expert parameter as well as an
indication of parameter dependencies.

Comment View
Unlike Help, the Comment View is not dedicated to pre-defined descriptions but rather to
your own comments on individual steps of the process. Simply select an operator and write any
text on it in the comment field. This will then be saved together with your process definition and
can be useful for tracing individual steps in the design later on.

Problems and Log View


Page 9

Sentiment Analysis

Data

A further very central element and valuable source of help during the design of your
analysis processes is the Problems View. Any warnings and error messages are clearly indicated
in a table here. In the first column with the name Message" you will find a short summary of the
problem. The last column named location" shows you the place where the problem arises in the
form of the operator name and the name of the input port concerned. A considerable innovation
of RapidMiner 5 however is the possibility of also suggesting solutions for such problems and of
implementing them directly. These solution methods are called quick fixes. The second column
gives an overview of such possible solutions, either directly as text if there is only one possibility
of solution or as an indication of how many different possibilities exist to solve the problem.

Log View
During the design, and in particular during the execution of processes, numerous
messages are written at the same time and can provide information, particularly in the event of
an error, as to how the error can be eliminated by a changed process design. You can copy the
text within the Log View as usual and process it further in other applications. You can also save
the text in a file, delete the entire contents or search through the text using the actions in the
toolbar.

Page 10

Sentiment Analysis

Data

CHAPTER 3
SYSTEM REQUIREMENTS
Hardware Requirements:

Processor
: Pentium 4
Memory Size : 1 GB RAM
Storage
: 80GB Hard Disk
Display
: EGA/VGA Color Monitor
600x800 Pixels Resolution
High Color (16 Bit)
Keyboard
: Any with minimum required Keys

Software Requirements:

Operating System : Windows XP and above, Linux, Mac


Java SE 1.6 and above

Page 11

Data

Sentiment Analysis

CHAPTER 4
Data Sentiment Analysis with Rapidminer
Sentiment analysis or opinion mining is an application of Text Analytics to identify and extract
subjective information in source materials.
A basic task in sentiment analysis is classifying an expressed opinion in a document, a sentence
or an entity feature as positive or negative.
The example presented here gives the list of movies and its review such as Positive or Negative.
This program implements Precision and Recall method. Precision is the probability that a
(randomly selected) retrieved document is relevant. Recall is the probability that a (randomly
selected) relevant document is retrieved in a search. Or high recall means that an algorithm
returned most of the relevant results. High precision means that an algorithm returned more
relevant results than irrelevant.
At first, both positive and negative reviews of a certain movie are taken. All of the words are
stemmed into root words. Then the words are stored in different polarity (positive and negative).
Both vector wordlist and model are created. Then, the required list of movies is given as an
input. Model compares each and every word from the given list of movies with that of words
which come under different polarity stored earlier. The movie review is estimated based on the
majority of number of words that occur under a polarity.
For example, when you look at Django Unchained, the reviews are compared with the vector
wordlist created at the beginning. The highest number of words comes under positive polarity. So
the outcome is Positive. Same happens for Negative outcome.

Page 12

Sentiment Analysis

Data

First step for implementing this analysis is Processing the document from data i.e. extracting the
positive and negative reviews of a movie and storing it in different polarity.
hug
The model is shown in Figure1.

Figure 1.

Page 13

Data

Sentiment Analysis

Under Process document, click on the Edit List on the right. Load the positive and negative
reviews under different class name "Positive" and "Negative" as shown in Figure 2.

Figure 2.

Page 14

Sentiment Analysis

Data

Under Process Document operator, nested operation takes place such as Tokenizing the words, Filtering
the Stop words, Stemming the words into root words and Filtering the tokens between 4 and 25 characters
as shown in Figure 3.

Figure 3.

Page 15

Sentiment Analysis

Data

Then two operators are used such as Store and Validation operator as shown in Figure 1. Store
operator is used to output word vector to a file and directory of our choosing. Validation operator
(Cross-validation) is a standard way to assess the accuracy and validity of a statistical model.
Our data set is divided into two parts, a training set and a test set. The model is trained on the
training set only and its accuracy is evaluated on the test set. This is repeated n number of times.
Double click on validation operator. There will two panels- Training and Testing. Under Training
panel, Linear Support Vector Machine(SVM) is used which is a popular set of classifier since the
function is a linear combination of all the input variables. In order to test the model, we use the
Apply Model operator to apply the training set to our test set. To measure the model accuracy
we use the Performance operator.
The operations under Validation is shown in Figure 4.

Figure 4.
Page 16

Sentiment Analysis

Data

Then run the model. The result of Class Recall % and Precision % is shown in Figure 5. The
model and vector wordlist are stored in a Repository.

Figure 5.

Page 17

Sentiment Analysis

Data

Then retrieve both the model and vector wordlist from the Repository you have stored earlier.
Then connect out from the retrieve wordlist to the process document operator shown in Figure 6.
The operations under Process document are same shown in Figure 3.

Figure 6.

Page 18

Sentiment Analysis

Data

Then click on Process Document operator and click edit list on the right. This time I
have added the list of 5 movie reviews from Rottentomatoes website and stored it in
a directory. Assign the class name as
Unlabeled as shown in fig 7.

Figure 7.

Page 19

Sentiment Analysis

Data

The Apply Model operator takes a model from a Retrieve operator and unlabeled data from
Process document as input and outputs the applied model to the lab port, so connect that to the
res (results) port. The result is shown below. When you look at Les Miserables, there is 86.4%
confidence that it is positive and 13.6% as negative because the match of the reviews with
wordlist under positive polarity is higher compared to negative polarity.

Figure 8.

Page 20

Data

Sentiment Analysis

CHAPTER 5
COMPARISION
Procedure

KNIME

RapidMiner Weka

TANAGRA

Pass (but
limited
partitioning
methods)

Pass (but limited


partitioning
methods)

Pass (but
Pass (but limited
limited
partitioning
partitioning
methods)
methods)

Pass

Fail (cannot save


parameters for
scaling to apply to
future datasets)

Fail (cannot save


parameters for
Fail (no scaling
scaling to apply
methods)
to future
datasets)

Fail (no
Descriptor selection wrapper
methods)

Pass

Fail (wrapper
Fail (no
Pass (but is not part methods valid
wrapper
of KnowledgeFlow) for logistic
methods)
regression only)

Parameter
optimization of
machine
learning/statistical
methods

Pass

Fail (not automatic)

Fail (not
automatic)

Fail (not
automatic)

Pass

Pass (but cannot


save model so have
to rebuild model for
every future dataset)

Fail (cannot
validate
independent
validation set)

Pass (but
cannot save
model so have
to rebuild
model for every
future dataset)

Pass (but
Partitioning of dataset
limited
into training and
partitioning
testing sets.
methods)

Descriptor scaling

Pass

Fail (not
automatic)

Model validation
Pass (but
using cross-validation limited error
and/or independent measurement
validation set
methods)

Orange

Table 1.

Page 21

Data

Sentiment Analysis

CHAPTER 6
ADVANTAGES AND DISADVANTAGES
Advantages

Free version has adequate resources to avoid big name options if a small business

It is a quality tool, given its ranking among the other commercial products

GUI is very user friendly.GUI is used to create data mining operators in XML files

XML Standardization is great for utilizing various data sources

Ease of use and available tutorials

Works on any operating system

Disadvantage

Some options are not available in free product, but you can upgrade

Possibly less customer service available for free version

There can be some restriction on customized use

Beginner may face some difficulty in understanding

Page 22

Sentiment Analysis

Data

Page 23