Sie sind auf Seite 1von 10

2.

1 Text Mining
Text mining refers to the process of deriving high-quality information from
text. It describes a set of linguistic, statistical, and machine learning
techniques
that model and structure the information content of textual sources for
business
intelligence, exploratory data analysis, research, or investigation. Text
mining is a
variation of data mining, that tries to find interesting patterns from
databases. As
most information is currently stored as text, text mining is believed to
have a high
commercial potential value. Text analysis processes typically include:
a. Information retrieval or identification of a corpus. This step includes
collecting
or identifying a set textual materials on the web, file system, database, or
content
management system;
b. Applying natural language processing, such as part of speech tagging,
syntatic
parsing, and other types of linguistic analysis;
c. Named entity recognition to identify named text features: people,
organizations,
place names, and so on using statistical techniques;
d. Recognition of Pattern Identified Entities. Features such as telephone
numbers,
email addresses, and quantities can be discerned with regular expression;
e. Coreference is identification of noun phrases and other terms that refer
to the
same object;
f. Identification of associations among entities and other information in
text;
g. Sentiment analysis, involves discerning subjective material and
extracting various
forms of attitudinal information, such as opinion, mood, and emotion;
h. Quantitive text analysis is a set of techniques stemming from the social
sciences
in order to find out the meaning or stylistic patterns of a casual personal
text.
Text mining is now broadly applied for various fields, including security,
biomedical, software applications, sentiment analysis, marketing and
academic applications.
Sentiment analysis refers to the use of natural language processing, text
analysis,
and computational linguistics to identify and extract subjective
information in
source materials. The basic task in sentiment analysis is classifying the
polarity of

text, whether the aspect is positive, negative, or neutral. Sentiment


analysis can be
split into two separate categories: manual or human sentiment analysis
and automated
sentiment analysis [22]. The differences lie in the efficiency of the system
and
the accuracy of the analysis. A human analysis component is required in
sentiment
analysis, as automated systems are not able to analyze historical
tendencies of the
individual commenter, or the platform and are often classified incorrectly
in the
expressed sentiment.
There are two main techniques for sentiment analysis: machine learning
based
and lexicon based [20]. In machine learning based techniques, two sets of
documents
are needed: training set and test set. A training set is used by an
automatic classifier
to learn the different characteristics of documents, while the test set is
used
to check how well the classifier performs. Machine learning starts with
collecting training dataset. The next step is training a classifier on the
training data. Once
the technique is selected, an important decision to make is feature
selection. It can
tell how the documents are represented.
In lexicon based technique, the classification is done by comparing the
features
of a given text against sentiment lexicons whose sentiment values are
determined
prior to the use. Sentiment lexicon contains lists of words and expressions
used to express peoples subjective feelings and opinions. For example,
with positive
and negative lexicons, the document is analyzed for which sentiment
need to find.
If the document has more positive word lexicon, it is positive, and vice
versa. The
lexicon based technique is an unsupervised learning because it does not
require prior training in order to classify data.

2.3 Related Works


Lexicon-based method has been implemented in sentiment analysis
before.
Taboada [16] has used lexicon-based method to develop Semantic
Orientation Calculator

(SO-CAL), which is applied to do sentiment analysis on blog postings and


video game reviews. The conclusion is that lexicon-based method for
sentiment
analysis are robust, result in good cross-domain performance, and can be
easily enhanced
with multiple sources of knowledge. Palanisamy, Yadav, and Elchuri [14]
were discovering sentiments on Twitter using a lexicon built by Serendio
anatomy,
which consists of positive, negative, negation, stop words and phrases.
The system
yields an F-score of 0.8004 on the test dataset.
Implementation of sentiment analysis in various language has been done
by
Cui anqi and Garcia. Cui Anqi [7] were applying lexicon-based method for
sentiment
analysis for a Chinese microblog named Weibo. It was meeting some
difficulties be cause Weibo messages usually have imbalanced sentiment
polarities. Garcia, Gaines, and Linaza [10] were applying lexicon-based
method for sentiment analysis of online reviews in Spanish. Preliminary
evaluation of the proposed approach has been conducted on the basis of
two real datasets of Spanish reviews related to accommodation and food
and beverage from TripAdvisor.com. Among the preliminar conclusions, it
can be mentioned that it seems to be some type of relation between
the length of the review and the subjectivity. A further conclusion is that
negative
sentiments are harder to detect than positive ones. Usually, negative
sentiments are
expressed using an indirect language, irony and also, explaining the whole
negative
experience as a story, which may or may not contain explicit negative
words.
2.4 Twitter
Twitter (www.twitter.com) is an online social networking and microblogging
service that enables users to send and read "tweets", which are text
messages limited
to 140 characters, via Twitter website, mobile devices, or with instant
messaging.
Twitter Inc. is based in San Fransisco and has offices in New York City,
Boston,
San Antonio, and Detroit.
Twitter was created in March 2006 by Jack Dorsey, Evan Williams, Biz
Stone,
and Noah Glass. The site was launched in Juli 2006. The service rapidly
gained
worldwide popularity, with 500 million registered users in 2012, with 400
million

posted tweets per day [24]. The service also handled 1.6 billion search
queries per
day. This high popularity leads Twitter to be used for various purposes,
such as
political campaigns, learning media, and advertisement, whereas it faces
various
issues and controversies regarding security, user privacy, lawsuit, and
censorship
[17].

2.4.2 Tweets
Tweets are text messages sent by users which is limited to 140
characters.
Users may subscribe to other users tweets, this is known as following,
and the
subscribers are known as followers. Users can group posts together by
topic or type
by using hashtags, words or phrases prefixed with a "#" sign. Similarly,
the "@"
sign followed by a username is used for mentioning or replying to other
users. To
repost a message from another user and share the message with ones
own followers,
the retweet function is symbolized by "RT" before the message.
A word, phrase, or topic that is tagged at a greater rate than other tags
is said to be a trending topic. Trending topics become popular either
through a
concerted effort by users, or because of an event that prompts people to
talk about
one specific topic. These topics help Twitter and the users to understand
what is
happening in the world.

2.5 R
R is a free software programming language and software environment for
statistical
computing and graphics, including linear and nonlinear modeling, classical
statistical tests, time-series analysis, classification, clustering, and others.
The R
language is widely used among statisticians and data miners for
developing statistical
software and data analysis. Polls and surveys of data miners are showing
Rs

popularity has increased substantially in recent years. R is also chosen as


one of
the most powerful open-source analysis tools for sentiment analysis [12]
with tm
package, which provides a comprehensive text mining framework for R.
R is an implementation of the S programming language combined with
lexical
scoping semantics inspired by Scheme. S was created by John Chambers
while at
Bell Labs. R was created by Ross Ihaka and Robert Gentleman [11] at the
University
of Auckland, New Zealand, and is currently developed by the R
Development Core
Team, of which Chambers is a member. R is named partly after the first
names of the first two R authors and partly as a play on the name of S
[13]. R is a GNU
project. The source code for the R software environment is written
primarily in C,
Fortran, and R. R is freely available under the GNU General Public License,
and
pre-compiled binary versions are provided for various operating systems.
R uses a
command line interface; however, several graphical user interfaces are
available for
use with R.
2.5.1 Versions of R
The versions of R from the oldest to the newest ones are mentioned as
below:
a. Version 0.16, this is the last alpha version developed primarily by Ihaka
and
Gentleman. The mailing lists commenced on April 1, 1997;
b. Version 0.49 (April 23, 1997), this is the oldest available source release,
and
compiles on a limited number of Unix-like platforms. CRAN
(Comprehensive R
Archive Network) is started on this date, with 3 mirrors that initially hosted
12
packages. Alpha versions of R for Microsoft Windows and Mac OS are
made
available shortly after this version;
c. Version 0.60 (December 5, 1997), R becomes an official part of the GNU
Project.
The code is hosted and maintained on CVS;
d. Version 1.0.0 (February 29, 2000), considered by its developers stable
enough
for production use;
e. Version 1.4.0, S4 methods are introduced and the first version for Mac
OS X is

made available soon after;


f. Version 2.0.0 (October 4, 2004), introduced lazy loading, which enables
fast
11
loading of data with minimal expense of system memory;
g. Version 2.1.0, support for UTF-8 encoding, and the beginnings of
internationalization
and localization for different languages;
h. Version 2.11.0 (April 22, 2010), support for Windows 64 bit systems;
i. Version 2.13.0 (April 14, 2011), adding a new compiler function that
allows
speeding up functions by converting them to byte-code;
j. Version 2.14.0 (October 31, 2011), added mandatory namespaces for
packages
and a new parallel package;
k. Version 2.15.0 (March 30, 2012), new load balancing functions.
Improved serialization
speed for long vectors;
l. Version 3.0.0 (April 3, 2013), support for numeric index values 231 and
larger
on 64 bit systems.
2.5.2 R Graphical User Interface
Many statisticians use R with the command line. However, the command
line can be quite daunting to a beginner of R. Fortunately, there are many
different
graphical user interfaces available for R which help to flatten the learning
curve:
a. RGUI, comes with the pre-compiled version of R for Microsoft Windows;
b. Tinn-R, an open source, highly capable integrated development
environment featuring
syntax highlighting similar to that of MATLAB. Only available for Windows;
12
c. Java Gui for R, cross-platform stand-alone R terminal and editor based
on Java
(also known as JGR);
d. Deducer, GUI for menu driven data analysis (similar to
SPSS/JMP/Minitab);
e. Rattle GUI, cross-platform GUI based on RGtk2 and specifically designed
for
data mining;
f. R Commander, cross-platform menu-driven GUI based on tcltk (several
plug-ins
to Rcmdr are also available);
g. RExcel, using R and Rcmdr from within Microsoft Excel;
h. RapidMiner;
i. RKWard, extensible GUI and IDE for R;
j. RStudio, cross-platform open source IDE (which can also be run on a
remote
linux server);

k. Weka, allows for the use of the data mining capabilities in Weka and
statistical
analysis in R.
2.5.3 R Add-on Packages
The capabilities of R are extended through user-created packages, which
allow
specialized statistical techniques, graphical devices, import and export
capabilities,
reporting tools, etc. These packages are developed primarily in R, and
sometimes
in Java, C and Fortran. A core set of packages is included with the
installation of
R, with 5300 additional packages (as of April 2012) available at the
Comprehensive
R Archive Network (CRAN), Bioconductor, and other repositories.
13
2.5.3.1 Add-on Packages in R
The R distribution comes with the following packages:
a. base, base R functions (and datasets before R 2.0.0);
b. compiler, R byte code compiler (added in R 2.13.0);
c. datasets, base R datasets (added in R 2.0.0);
d. grDevices, graphics devices for base and grid graphics (added in R
2.0.0);
e. graphics, R functions for base graphics;
f. grid, a rewrite of the graphics layout capabilities, plus some support for
interaction;
g. methods, formally defined methods and classes for R objects;
h. parallel, support for parallel computation, including by forking and by
sockets,
and random-number generation (added in R 2.14.0). ;
i. splines, regression spline functions and classes;
j. stats, R statistical functions;
k. stats4, statistical functions using S4 classes;
l. tcltk, interface and language bindings to Tcl/Tk GUI elements;
m. tools, tools for package development and administration;
n. utils, R utility functions.
14
These base packages were substantially reorganized in R 1.9.0. The
former
base was split into the four packages: base, graphics, stats, and utils.
Packages
ctest, eda, modreg, mva, nls, stepfun and ts were merged into stats, and
package mle moved to stats4.
2.5.3.2 Add-on Packages from CRAN
The Comprehensive R Archive Network (CRAN) is a collection of sites
which

carry identical material, consisting of the R distributions, the contributed


extensions,
documentation for R, and binaries. The CRAN src/contrib area contains a
wealth
of add-on packages, including the following recommended packages
which are to be
included in all binary distributions of R:
a. KernSmooth, functions for kernel smoothing and density estimation
corresponding
to Wand and Jones [21];
b. MASS, functions and datasets from the main package of Venables and
Ripley
[19], for R versions prior to 2.10.0;
c. Matrix, a Matrix package, recommended for R 2.9.0 or later;
d. boot, functions and datasets for bootstrapping from Davison and
Hinkley [8];
e. class, functions for classification (k-nearest neighbor and LVQ), for R
versions
prior to 2.10.0;
f. cluster, functions for cluster analysis;
g. codetools, code analysis tools, recommended for R 2.5.0 or later;
h. foreign, functions for reading and writing data stored by statistical
software like
Minitab, S, SAS, SPSS, Stata, Systat, etc;
15
i. lattice, for lattice graphics;
j. mgcv, routines for GAMs and other generalized ridge regression
problems with
multiple smoothing parameter selection by GCV or UBRE;
k. nlme, fit and compare Gaussian linear and nonlinear mixed-effects
models;
l. nnet, software for single hidden layer perceptrons (feed-forward neural
networks),
and for multinomial log-linear models, for R versions prior to 2.10.0;
m. rpart, recursive partitioning and regression trees;
n. spatial, functions for kriging and point pattern analysis from Venables
and Ripley
[19], for R versions prior to 2.10.0;
o. survival, functions for survival analysis, including penalized likelihood.
2.5.3.3 Add-on Packages from Bioconductor
Bioconductor is an open source and open development software project
for
the analysis and comprehension of genomic data. Most Bioconductor
components
are distributed as R add-on packages. Initially most of the Bioconductor
software
packages focused primarily on DNA microarray data analysis. As the
project has matured,

the functional scope of the software packages broadened to include the


analysis
of all types of genomic data, such as SAGE, sequence, or SNP data. In
addition,
there are metadata (annotation, CDF and probe) and experiment data
packages.
The packages from Bioconductor are available at
http://www.bioconductor.org/
16
2.5.3.4 Add-on Packages from Omegahat
The Omega Project for Statistical Computing provides a variety of
opensource
software for statistical applications, with special emphasis on web-based
software,
Java, the Java virtual machine, and distributed computing. R packages
available
from the Omega project is available at http://www.omegahat.org/.
2.5.4 R Studio
RStudio is a free and open source integrated development environment
(IDE)
for R. It is available in two editions: RStudio Desktop, where the program
is run
locally as a regular desktop application; and RStudio Server, which allows
accessing
RStudio using a web browser while it is running on a remote Linux server.
Prepackaged
distributions of RStudio Desktop are available for Microsoft Windows, Mac
OS X, and Linux. RStudio is written in the C++ programming language
and uses
the Qt framework for its graphical user interface. The user interface of
RStudio can
be seen in Figure 2.1.
The RStudio team contributes code to many R packages and projects.
Here
are a few of the prominent ones:
a. ggplot2, is a plotting system for R, based on the grammar of graphics,
which
tries to take the good parts of base and lattice graphics and none of the
bad
parts;
b. knitr, designed to be a transparent engine for dynamic report
generation with
R, solve some long-standing problems in Sweave, and combine features in
other
add-on packages into one package;
c. plyr, is a set of tools for a common set of problems: to split up a big
data structure
17
Gambar 2.1: RStudio

into homogeneous pieces, apply a function to each piece and then


combine all the
results back together;
d. RPubs, is a free publishing service for R Markdown, which weaves
together the
writing and output of code;
e. devtools, is a developer tool for building R packages. It removes the
pains and
bottlenecks of package development;
f. packrat, is a dependency management tool to make the projects more
isolated,
portable, and reproducible.

Das könnte Ihnen auch gefallen