Beruflich Dokumente
Kultur Dokumente
0 ERA: A
FRAMEWORK DESCRIBED WITH EMPHASIS ON
OPINION MINING
by
Gerasimos Galatis,
ggal@ait.edu.gr
Supervisor
Dr. Sofia Tsekeridou,
sots@ait.edu.gr
ATHENS
2011
Declaration
I, Gerasimos Galatis, declare that the work presented in this thesis is original and no
part of it (including the document, the implementation code, etc.) has been copied
from other sources. Work related to this one is cited appropriately.
Gerasimos Galatis
30/03/2011
The work contained in this thesis “Business Intelligence on 2.0 Era: A Framework
Described with Emphasis on Opinion Mining” by Gerasimos Galatis has been
carried out under my supervision.
30/03/2011
Keywords:
Business Intelligence, Opinion Mining, Business Process Management Systems,
Business Intelligence 2.0, Real-Time Analytics
Acknowledgements
I would like to thank my supervisor, Dr Sofia Tsekeridou for her sincere support
E X E C U T I V E S U M M A RY
Decision making is a main task for managing a business on every level, from
operational to strategic. The first material for this process is information. IT has
contributed from 60s’ on the field of transforming data into information and
delivering it through the technologies and techniques which were caught under the
umbrella term of Decision Support Systems, later called Business Intelligence. BI era
is constantly evolving, integrating further with Business Operation and adapting on
the changes that take place in its environment.
Business Intelligence term is used broadly from early 90s’ after it was coined from
Howard Dressner. Various definitions are used, but in general it could be defined as
“the process of turning data into information and then into knowledge”(60). Business
Intelligence is based on the existence of a Data Warehouse on which data from all
sources are integrated and from which all views of information are dragged. This
architecture ensures a single view of the truth. For this type of architecture, the
abbreviation BI-DW is used BI framework consists of two main entries: getting data
in through which data from various sources are integrated into Data Warehouse using
ETL processes and getting data out which transforms data from Data Warehouse into
meaningful information. For the later, a number of methods are used but in any case
there should be created an environment on which uncertainty would be minimized and
even un-experienced managers would be able to take right decisions, by transforming
effectively raw data into decisions.
The perception of a Business Intelligence application is not easy. Main reasons are:
• the time lag that exists between BI starts affecting decision process and the
time that that is visible on company’s results.
Thus, managerial support and a culture towards knowledge and information are
required.
The structure of a traditional BI-DW is composed of four main tiers:
• Exploitation of new data sources created under Web 2.0 environment. In this
context every person linked to the Internet has became editor producing
content concerning every object and aspect of our life. The successful
recognition and integration of all those, mainly unstructured, sources of data
would affect decision making process. For this reason, a new era under the
term Opinion Mining has being developed and is strongly active.
Opinion Mining combines Natural Language Processing and Text Analytics in order
to address the problem of extracting qualitative attributes from a text. The goal of the
research on this areas is the development of techniques and tools that would be able to
process large quantities of opinionated texts in order to answer on questions like:
• What are the attributes of the opinion? (negative, positive, neutral, strong,
weak)
Opinion Mining can be done on Document Level, categorizing the whole document
as positive, negative or neutral, on Sentence Level or on Feature Level where each
object’s feature is graded on this scale. The basic steps for an Opinion Mining
Analysis is to Corpus Collection, Corpus pre-processing, Development of
Opinion Words and Feature Words Development, Subjectivity Classification,
Identification of Opinion Holder, Sentiment Classification and Visualization of
Output.
A number of issues are raised during Opinion Analysis which make it difficult to have
100% success on it. Such issue are for instance coreference problem, multi-word
expressions, thwarted expectations, ambiguous word whose meaning depends on the
context and the text domain and dialect.
The use of Semantic Theory on the Opinion Mining field allows for analysis to
identify more types of connections and links between the objects as well as resolve
issues which are caused because of the few parameters that traditional analysis
recognize.
TABLE OF CONTENTS
SOCIAL-BASED LEARNING 12
LIST OF FIGURES
LIST OF FIGURES
CHAPTER 2
2.1: From Database Management to Real-Time Business Intelligence and BI 2.0
2.2: The role of BI Systems in Decision Making, (Olszak and Ziemba, 2007)
2.3: BI Performance Management
2.4: Basic BI-DW System Structure
2.5: Evolution of BI to strategic tool, Yahoo Data Strategy Team
2.6: Gained Value reducing action time
2.7: How Web transformed on its 2.0 state
2.8: How Web 2.0 create, use and affect data handling
CHAPTER 3
3.1: Feature Words – Objects Relation
3.2: Indicative Steps on Opinion Mining Process
3.3: Index of Basic Methods for Sentiment Analysis
3.4: Evolution of Opinion Mining from “bag-of-features” to Semantic
CHAPTER 4
Figure 4.1: BPMS based on Ontologies, with Opinion Mining Sub-Module
Figure 4.2: Opinion Mining Analysis Process
Figure 4.3: Opinion Mining enhancements on Basic Hotel SAP® Business One Form
Figure 4.4: Sample of report produced with BusinessObjects® software
Figure 4.5: Current State and future development of Sentiment Analysis Module
Figure 4.6: Current State and next steps of Sentiment Analysis Module, Functions
View
1 .1 I N T R O D U C T I O N
Decision. The process of choosing a path against all alternatives taking into
consideration all affecting parameters in order to produce a desired result.
Information. The main element for a person that has to take a decision which
describes a part or the whole of a parameter. Those two notions are strongly correlated
on the field of decision-making. A Person that needs to take a decision should be
aware of all the environmental parameters that affect it. On Business field the same
thing stands. Manager taking a decision should comprehend the environment (internal
and external) on which his company operates. Even though, this does not guarantee
that the desired result will be accomplished, it is still the main prerequisite.
Business Intelligence era is the evolution of Decision Support Systems that are
developed from the midst of the twentieth century. Under this umbrella term it is
included all hardware, software and methods which were used to support the decision
on every level of the company. As the environment on which a Business operates is
extremely complicated – and even more today, in a much more globalized economy –
its analysis is very difficult. The quantity of data is very big for human brain to
analyze them. The first DSS Systems task was to do what a Computer does well:
quick computations in raw data according to predefined algorithms, presenting results
on a human-comprehensive way, as a first material for the decision maker.
This is still a main task for BI software but not the only one. Today, BI takes a more
important role also on the next steps of the cognitive process of decision-making,
going steps further from just summarizing raw data. It also integrates more with the
core of the Business, processes, in order to affect it effectively and directly. For being
just an informer it is an active performer on a lining business environment. Moreover,
it introduces new ways for the Business to handle Information Overloading that
comes on Web 2.0 environment; information so valuable, yet difficult to be handled.
The vision of such a system is to fully support decision-making process, in order to
help and protect managers when leading their operations, as well as taking an active
role on business’s value chain rather than the role of a simple informant.
On this document, the current trends and situation on Business Intelligence era are
thoroughly investigated. Attention is given to Opinion Mining, as a group of
techniques go handle unstructured data from Web, whose number and importance for
On Chapter 2, are discussed the major notions around the term of Business
Intelligences as well as the trends that shape the latest evolutions that have taken
place. Objective is to create a clear picture of what Business Intelligence might look
like in a few years from now
CHAPTER 2: BUSINESS
INTELLIGENCE ON 2.0 ERA
2 .1 C H A P T E R I N T R O D U C T I O N
Under the umbrella term of Business Intelligence are gathered all the techniques,
processes and methods that are used in order to enhance decision-making on corporate
environment. On the chapters below, the term, the current situation and the evolution
path to more broad, collaborative and responsive BI.
2 .2 B U S I N E S S I N T E L L I G E N C E : T H E T E R M
2 . 2. 1 W h a t d o es i t mea n
The phrase Business Intelligence was initially heard on a work of Hans Peter Luhn
(1958) who defined Business Intelligence as “..the process that provides means for
selective dissemination to each of its action points on accordance with their current
(86)
requirements or desires.” . However, it was established as a common term
regularly used on today’s landscape from Howard Dressner on early 90’s.
- Getting data in, or else referred as Data Warehousing “means moving data
from a set of source systems into an integrated data warehouse”. The flows
are coming from different sources, internal or external of the company,
structure or unstructured and in various forms (“heterogeneous platforms”)..
Data should be integrated and transformed in a common shape in order for
further analysis to be able to take place. The intermediate processes on this
stage are included on the group of ETL processes (Extract, Transform, Load).
For lowering the load on Data Warehouse, the process of ETL, many times, is
taking place on an ODS (Operational Data Store) on which data
transformation is taking place before loaded on Data Warehouse. Moreover, in
many implementations Data Marts are used, which are smaller data
repositories that serve specific users (of specific department, geographic area,
application). All data marts should use the same Data Warehouse in order to
ensure a “single version of truth”. Getting data is the most challenging
aspect of BI, “requiring about 80% of time and effort and generating more
than 50% of unexpected project costs. “
- Getting data out, which is the process that really adds value for the business.
This function consists of transforming data form data warehouse / marts into
meaningful information. It can take place in the form of enterprise reporting,
OLAP, querying etc.
• Strategic Level. The decisions are mid and long-term. Data should be updated
but not real-time. Accuracy and comprehensive visualization of a complex
environment are important.
2 . 2. 2 H i s t o ri ca l E v o lu t io n
Business Information Delivery has really evolved from the midst of twentieth
century . Even on the relatively new Business Intelligence époque (midst 90s) the
(85)
landscape is constantly changing. As coins for the Business Intelligence status:
“The emergence of the data warehouse as a repository, advances in data
cleansing, increased capabilities of hardware and software, and the emergence of the
web architecture all combine to create a richer business intelligence environment
than was available previously.”
Business Intelligence systems are having their roots on Decision Support Systems,
whose investigation started on late 60s’. The research effort on this period of time was
to “study the use of computerized quantitative models to assist in decision making and
Those systems supported Businesses until late 80s’, when Data Warehouse notion was
introduced. The problem that Data Warehouse tried and managed to tackle was the
multiple data sources integration under a common umbrella. The approach on
handling data until then was application-centric: any application the company
operated had its own database with data, even for the same event. Moving into a
common data-source for any DSS application, the systems evolved to a data-centric
approach. What they managed to do was to have a single version of truth.
The next steps of evolution for Business Intelligence tools are towards two main
directions:
2 . 2. 3 V a lu e an d Ch a l len g es
Every business operates on an environment with many variables and dimensions. A
managerial decision, on whatever level is being taken – strategic or operational – is a
response to a stimulus from its environment – internal or external. In any case, it is
obvious that it’s of the greatest importance for any decision-maker, to be properly
informed. The main objective of Business Intelligence era is to process information
from all sources, integrate them and visualize them on a comprehensive and user-
friendly way that will help decision-maker to take the right decision. The output of BI
tools can be:
• pattern discovery
• cause-effect cases
• statistical analysis
• what-if scenarios
• mind-maps
and generally notions required for a person to take a decision but demanding a lot of
time, effort and insight if made by a human. On its most evolved stage, Business
Intelligence should create an environment on which uncertainty would be
minimized and even un-experienced managers would be able to take right
decisions, by transforming effectively raw data into decisions, following the chain
pictured below.
Figure 2.2: The role of BI Systems in Decision Making, (Olszak and Ziemba, 2007)
According to Gartner (2002) the main scopes BI served on strategic level are:
(79)
notices the big importance of knowledge in order to operate effectively. He
recognizes four types of knowledge:
B I D M E T S 24
BUSINESS INTELLIGENCE ON 2.0 ERA
• the time lag that exists between BI starts affecting decision process and the
time that that is visible on company’s results.
• Effective BI Governance
• Managerial support
• Tangible results
From both lists, it seems obvious how important is the managerial support, the
alignment with company’s culture and users’ acceptance.
(72)
As described on “..Decision makers and organisations should predominantly
associate BI with organisational implementation of specific philosophy and
methodology that would refer to working with information and knowledge, open
communication, knowledge sharing along with the holistic and analytic approach to
business processes in organisations..”
On (72) it is noticed the high importance of User’s involvement on the whole lifecycle
of a BI implementation. This seems essential for most of application developments on
the premises of a company as the final output will serve, directly or indirectly, users.
(54)
Gartner notices that BI Applications should become part of users’ workflow. In
any other case, even if it has reached its requirements, users will not adopt it as they
will still have to cope with their main work duty.
The users’ involvement should start from the first step of requirements’ analysis and
should include:
In the path of proving the effectiveness and improving its performance, BI objective
measurement methods have being proposed. To measure BI is not an easy task to do
but is valuable. “It may be suggested that BI has no value itself. What bears the
business value is the decision that finally took place. The time lag between the
Herring (1996) identifies four indicators for measuring BI success – time savings, cost
savings, cost avoidance and revenue enhancements – without, however, explicitly
creating some measurement means. Sawka (2000) proposes to apply measurements
depending on the contribution of BI on specific decisions. Davison (2001) proposed a
solution based on the perception of the users who are asked specific questions for the
success of a BI project. However, this also can be monetized.
B I D M E T S 27
BUSINESS INTELLIGENCE ON 2.0 ERA
Finishing with the basic notions that comprise Business Intelligence, its value on the
business and the challenges it faces, we further proceed with the main elements of a
BI Implementation.
2 .3 S T R U C T U R E O F A B I - D W
A Business Intelligence System based on a Data Warehouse structure is the common
architecture today. In short, this is called BI-DW. The structure of such a system is
pictured below
EAI
ETL
Process
Data
Warehousing
Tier
ODS
Metadata
repository
Data
Warehouse
Data Analysis
Tier Statistical Data OLAP
Analysis Mining
Data
Predictive Trend Data
Distribution Tier
Analysis Analysis Cubes
Alerting Dashboards
On the figure above, the structure of a basic BI-DW is pictured into the blue box. Out
of that box, there are extra layers / evolutions that are shaping the progress path of BI
era today, namely semantic layer, BI 2.0 evolution and BPI evolution. Those will be
discussed on the next chapter.
A BI-DW can be divided on 4 basic tiers that are followed successively from the
acquisition of the data till their presentation on the user:
• Data Acquisition Tier, where all required data sources are spotted and
crawled regularly in order to acquire updated information. This can be a pull
process where BI mechanism is searching / crawling for information or push
process where data are fed into the system whenever information is available
• Data Warehousing Tier, where data from different sources is integrated and
stored into data warehouse. Business Data Warehouse is a unified view of the
enterprise primarily for integrated reporting (Devlin & Murphy, 1988). The
existence of a DW secures the assures a single view of truth for all
applications on the premises of the company
• Data Analysis Tier, where data analysis is taking place in order to answer on
questions made to the system. The types of analysis vary and depend on the
type of question. Data Analysis Process can be called from any application
that operates on company premises and it is not necessary to reside on a
central mechanism. Every application can have its own BI tools. However,
data remains the same
• Data Distribution Tier, where data is being presented to the user. This can be
view-only , with the user receiving data either on-request or on specific time
periods or interactive where user have controls to the data shown to him in
order to check alternative scenarios or give feedback to the system e.g. mind-
maps.
2 . 3. 1 D a t a A cq u is i t io n T i er
On that stage, data are collected from different sources. Those can be external to the
company or internal (feedback from business processes). In any case, all sources that
contain candidate data should be spotted in order to be monitored.
Recently, the term EAI (Enterprise Application Integration) was introduced. Software
on that stage, serves as middleware on which all separate applications used on the
premises of a company are integrated. Thus, heterogeneous platforms are connected
through a single hub, without having to alter their structures or create links between
all platforms each other.
A recent challenge is the unstructured sources that are operating on the premises of
Web 2.0. Those feeds are invaluable and have unique characteristics that make them a
prerequisite for an effective BI to monitor, at least partially. More details on that will
discussed on next chapter
2 . 3. 2 D a t a W a reh o us ing T i er
(79)
As noted on “Utility of data warehouses largely depends on the quality of their
data stored”. A basic component on that level are ETL Tools. The acronym means
extract-transform-load and it describes all the processes that are responsible for those
three tasks. Extraction involves the tasks described on the last chapter and it is about
obtaining access to data originating from all candidate sources. Information like
extraction time, structure of source data etc. are also logged.
The data is then transformed in a series of actions that are thought to be the most
complex stage of the ETL process. As (79) describes “the process is usually performed
by means of traditional programming languages, script languages or the SQL
language. Data transformation means data unification, calculation of necessary
aggregates, identification of missing data or duplication of data”.
Data loading is the last stage on which data warehouse is getting updated with data
processed on the stages before. What is important is the speed on performing that
(79)
task. Again, as notices: “since the process in question frequently involves
switching the system into the offline mode, it is particularly important to minimise the
time that is necessary to transfer data”
Depending on the scope of realised functions ETL tools may be divided into four
categories (Meyer, 2001):
• EtL tools that gives more attention on extraction and loading tasks
• eTL or ETl that prefer specific types of input or output data (e.g. they function
exclusively in text files or specific database formats), and offer reliable and
fast functions of data processing and transforming
Moreover, on some implementations, Data Marts exists. Those structures are small
data warehouses which serve the needs of specific applications. Again, such a work is
done in order to balance the load and handle effectively different processes taking
place at the same time.
(69)
On Operational Data Stores and Data Warehouses as thought as a support
middleware layer between the transactional applications and the decision-support
module.
2 . 3. 3 D a t a A n a ly s is T ier
Data Mining is “the process of identifying and interpreting patterns in data to solve
(64)
a specific business problem”. Data mining is looking for patterns and
relationships on data without knowing the question, e.g. without searching for
something specific. As described on (64) the steps of Data Mining are:
• Data model development. On that stage data and metadata are being mapped
to the Business Issue that was previously recognized
• Display results.
2 . 3. 4 D a t a D is t rib u t ion T ie r
Objective on this stage, is to offer a comprehensive view of the data to the user. This
operation is extremely critical as data that cannot be used effectively, add no business-
value, whatever the quality of the analysis is. That is why so much analysis has being
done on that.
On (68) four types of visual display format for BI tools are identified:
• network displays,
• map displays, which provide a view of all items on an upper level. They are
effective on showing a lot of data on a single view.
• Ad hoc queries as the option for the users to create their own queries.
On that tier are also included the tools that system provides to the user in order to
interact with it. This can be either for posing the questions or for interacting with the
data that the system showed to him as results to his initial question.
As for the first case (query posing),(68) recognizes two types of queries:
• Specific Query Formulation, where user sets specific criteria in order for the
system to show him a specific set of results that agree with that criteria. This is
the traditional query mechanism
• Broad Query Formulation, where user poses a broad query and the system
shows him a broad selection of results for him to scan.
Regarding the issue of user interaction on the results, a good example is that of mind-
map tools where user gives feedback to the system in a number of iterative cycles
until a desired result is reached. A mind-map tool example is further discussed on a
next chapter.
On (65) it is discussed the ability that a BI system should have handle information not
only from data warehouse, but also from totally unstructured sources (a trend that is
discussed further on a next chapter). A system should respond to users’ requests with
keywords or query parameters showing data not structured. In order to tackle that
challenge, a metadata repository is used where objects of unstructured data are tagged
with keywords globally used on the DW and which also link them with structured
data.
On that basic architecture, several ideas are applied on the last years in order to help
companies operate more effectively on the new environment that is being shaped.
Those evolutions are discussed on the next chapter.
2 .4 B U S I N E S S I N T E L L I G E N C E : E V O L U T I O N
NEXT
2 . 4. 1 G e n era l T ren ds
(59)
Gartner, recognized 2011 as the first year that trends on Business Intelligence era
were lead primarily “..from the need of users for easiness of use and flexibility than
the need of IT Department for control on data and standards..” . According to this, it
describes the strategies of the two major groups of BI software: Traditional
Enterprise BI Platforms and Data Discovery Platforms, that offer convenient tool
for information retrieval. The trend stated above would normally gives a window of
opportunity for the second group of smaller vendors that offer tools for easy
information discovery against big platforms vendors. However, big vendors continue
to have the biggest market share due to their promises for tighter processes integration
and vertical integration within their information infrastructure stacks. In any case, on
the above what is described is a turn to user friendliness and easiness of using the
data.
On its 2011 report, Gartner also predicted that “interactive visualization, predictive
analytics, dashboards, and OLAP usage will increase” even though the largest
proportion of BI processes are ad hoc reports. The new interfaces will push the
information from analysts to a larger portion of users. In a few words BI will become
pervasive, spreading to a larger user-base due to the availability of easy-to-
understand dashboards and Web-Based platforms accessible from any place Internet
Connection exists.
• Consumerization of BI. “BI Tools must be simple, mobile and “fun” in order
to expand use and value.
(60)
Regarding the fourth (BI embedded in the business process) , on is noted the
change on Businesses that also bring a wind of change on Business Intelligence era
also. Companies are more process-driven, linking activities throughout the company’s
workflow in order to control results. KPI ‘s are set and watched in order to achieve
performance. Measurements are shared through all the company promoting
information democracy.
(56)
On the figure below (taken from Yahoo! Data Strategy Plan ) what is pictured is
the evolution from the initial state of plain transactions description to strategic
tool. A big step for tomorrow is BAM (Business Activity Monitoring). This is further
discussed later.
Transactional Reporting
“give me my reports”
Tactical Decisions
“what should I do right now”
integrated tools that supports business and IT users in managing process execution
quality” (84)
The notion of BPI, creates a new era where Business Intelligence really affects the
value-chain of a company, interacting directly with the elements that really create
value for a company: its processes. It is based on the theory of Closed-Loop Business
Intelligence and it embodies and integrates the notions of Business Performance
Management and Real-Time Data Warehousing. It creates an extra layer that
covers the operation of the whole BI Engine, as it was described on Figure 2.3. Those
three basic notions are described further below.
2.4.2.1 Closed-loop BI
Colin White Talks about a structure of Closed-Loop Business Intelligence (53), on as
BI System feeds Operations on their decision making process, an opposite flow also
exists on which Operations are feeding BI with data for analysis. What exist in the
middle is collaborative applications that help users make decisions for Operations,
using data from BI Analysis.
strategic level are passed quickly back to the system, decreasing the required lapse
time.
2 . 4 . 2 . 2 B u s i n e s s P er f o r m a n c e M a n a ge me n t
Business Performance Management is a management theory, prevalent on the last
years. It is defined as “a set of processes that help organization optimize business
performance by encouraging process effectiveness as well as efficient use of financial,
(60)
human and material resources” . BPM can be considered as a process-
optimization approach. A prerequisite for performing this type of management is to
be able to transfer the strategic goals of the company on the day-to-day, operational
level. This is done through the implementation of KPIs (Key Performance Indicators)
Those KPI’s should be fed “at the right time, at the proper decision level and in the
best form”.(60)
BPMS (Business Process Management Systems) is the package that is created from
the convergence of BPM with Information technology (and of course BI Packages),
“used to automate processes and provide process monitoring and improvement
capabilities representing a revolutionary way of using technology in the business
environment“(83). With BPMS what is tried to be done is to bridge the gap between
executing a process and measuring its performance. This can be done on an iterative
process where system is giving feedback to the user for process’s performance after
each change he makes.
• PDW Loader, which extracts data from all data sources, checks them and
integrates them into the PD Warehouse
• Process Data Warehouse, where data are stored for further analysis
• Process Mining Engine, which applies data mining techniques to data on the
PD Warehouse
Those elements are corresponding to the four tiers of a BI-DW described on the
previous chapter.
The software that allows the monitoring of Business Activities is also referred as
BAM (Business Activity Monitoring). BAM is usually thought as “the technology
module of BPMS being the real-time reporting, analysis and alerting of significant
business events, accomplished by gathering data, key performance indicators and
business events from multiple applications” (60).
• KPI manager that computes all the indicators necessary at the different levels
to feed dashboards and reports
• a Rule Engine that monitors KPIs in order to alert users for events
2 . 4 . 2 . 3 R e a l - t i me B I
Real-time Business Intelligence, is the answer to the need of users for fast responses
with fresh data to ad-hoc queries. It is enabled from Enterprise Information
Integration, Enterprise Application Integration and real-time data warehousing
technologies. As decisions on strategic level, are mainly long-term and this
(56)
technology refers primarily to tactical decisions. On some examples of real-time
analytics are mentioned:
It is a fact operations like the one mentioned above do not have value if the data they
process are outdated. Traditional BI systems do need time in order to collect, process
and integrate information until the time it is available for analysis. From the time a
query is posted to the system again there is a time requirement until this is processed
and is available for presentation to the user. The fact above is presented on the time
lags theory below. (62)
• Data Latency, which is the time required from the moment an event occurred
until the moment it was stored in the data warehouse
• Analysis latency, which is the time between its storage and the moment its
analysis was finished in order to become available to users
• Decision Latency, which is the time between its availability and the moment
an action took place
The first two require changes on technical aspects, The third one on business
processes. Improvement on the first two alone, does not bring any business
value
(61)
However, as it is noticed on real-time is not usually the requirements. Instead,
right-time is what is needed. For example, for credit card fraud, instantaneous is not
the objective but instead the time latency should be some seconds. For other
applications, right-time means a time window of minutes or even hours
B I D M E T S 40
BUSINESS INTELLIGENCE ON 2.0 ERA
• Money. It is not cost free to reduce latency. Specialized hardware and software
is required.
o Data Propagation, with which data are copied from one source to
another. This is a push operation, meaning that in order for the process
to begin, it does not wait for a request. Enterprise Application
Integration (EAI) is such a technology.
• Flight dashboards, which helped operations identify issues into their flight
network. For example, on-time of flights was measured with real-time arrival
and departure times analysis
In order to balance the load on hardware they divided queries on the database on two
groups: tactical which should provide real-time data and are set as high-priority on
query engine and strategic.
Another view, on a real-time system, is the ability to transfer quickly users’ action
into the system. (66) gives gravity to real-time action of an RTBI package, meaning the
option of the user to intervene instantly with Business Operations in order to calibrate
them according to BI results. A main component of that scenario is process
dashboard through which user drives and change processes. Enterprise Application
Integration (EAI) suites are offering a solution for integrating heterogeneous
applications (different operating systems, different databases, different backbone
languages), on a common platform.
• Custom Solutions. They are optimized for the specific needs, the initial cost
development is low and can adapt to meet changing business needs. On the
other hand it lacks integration with contextual data (amongst others)
hand many EDW fail due to organizational reasons (dpts loose control of their
data and agility). It is not real-time and costly (up to 50$ millions for a large
organization)
2 . 4. 3 T h e Pu rp l e E v o lu t io n : Bu s in es s I n te ll i g en ce 2 .0
2 . 4 . 3 . 1 E n v i r o n me n t 2. 0
Web has transformed on the last decade in a extraordinary way, already noted by any
Internet commentator. On Web 1.0 there was a clear distinction between two worlds:
Net who contained information and Users who read them. Editors of that content were
person that ran their web-pages.
On Web 2.0, the tap opened. All users became content-creators. Data production
multiplied. This fact, the development of faster networks as well as the emergence of
Social Web apps, amongst others, lead more and more people to establish Internet as
their primary source of Connecting with the Community (communicating, getting
informed etc.). Web Environment 2.0 is a “living organism” with billion nodes
interacting each-other. Data is produced and evolved in great pace, in real-time.
Figure 2.7: How Web transformed on its 2.0 state. Source: Dion Hinchcliffe (2006-04-02). "The State of Web 2.0",
Web Services Journal
The new role of those services on society today was highly noted on great political
and social events that took place the last years (Georgia, Tunisia, Egypt, Libya).
Twitter was thought as a primary news source, even for big media. Its real-time
engine created news much more quickly than traditional media means, even though
not so reliable.
Web 3.0, or whatever next evolution of Internet will be called, will even more
multiply the nodes of the Network and affect data produced. Internet of things,
Augmented Reality, Semantic Internet are already extending it in all existing and new
dimensions.
(51)
On , the position of Business Intelligence in the context of the present Web-
Environment 2.0 is established. Commentator recognizes that, in parallel with Web
2.0 evolution, BI Technologies have changed mainly on the output side: HTML
and PDF are used initially and then whole new platforms like Web portals and mobile
(52)
phones. However, in a world that Twitter grew 2500 % on 2009 , BI has not
absorbed the changes. As described on (51) BI 2.0 is the combination of:
• Advanced analytics
• Enterprise Integration
• In-memory analytics
• Open Source BI
Continuing, recognizes Web 2.0 services main aspects and the why they affect
Business Intelligence feature. Below, it follows a list of 2.0 services and facts about
the way they create, handle or visualize data. Those facts may mark a new époque
for Corporate Business Intelligence as well.
Create New Data:: Content Creation through Constant User Involvement Environment
Create New Links for Existing Objects: Any user can link information with relevant issues
Enrich Data:: Rating of data uploaded creates a sentiment of its “social vibe”
Enrich Data:: Tag, comment on existing content. Collective Intelligence through people
bringing up information that think as valuable
Visualize data: Metrics regarding article popularity are visible
Create New Links for Existing Objects: Nodes(users) are creating new specific-
dimension connections (professional)
Figure 2.8: How Web 2.0 create, use and affect data handling
B I D M E T S 46
BUSINESS INTELLIGENCE ON 2.0 ERA
• Visual display of data will evolve and used broadly to summarize more
effectively than ever information
On that environment, there should be found a way to take advantage of the new
capabilities. On (78) the term of Web Business Intelligence was introduced in order to
define a BI system that will draw data automatically and in a meaningful way from
the constantly updated Web. The modules of such a system are:
• Profile Database, which both modules are using for drawing data
High inter-correlated with the concepts already discussed is the Management of the
Knowledge created and the processes with which structured and unstructured data are
handled in order to present it in a way that enables action to be taken. The categories
are:
• Pattern Discovery
• Relevance Ranking, which refer to the problem of finding the more relevant
sources between a big number of candidates.
On (80), researchers describe a solution on which key terms, fed by analysis on DW are
queried on Web crawlers in order to enhance knowledge. Search engines are queried
with keywords. Web Pages are parsed and indexed. Co-occurrence analysis took place
in order to identify groups of terms called “Web Communities”. The results were
visualized on a map environment.
The challenges on discovering and integrating unstructured data into Data Warehouse
(67)
needs special handling. proposes visualization methods based on hierarchical and
map displays as more effective on access and browsing of information. Paper use the
term of competitive intelligence tools to describe those that aim on systematically
collecting and analyzing information from the competitive environment to assist
organizational decision making. Data sources for those tools is primarily the Web.
That category is versus those tools that use massive amounts of data stored on data
warehouses in order to extract essential business information from them. Reviewing
those tools, they notice that they provide different views of the collected information
but no further analysis.
Common methods to access web content are also crawlers which materialize web
sites locally by following links, such as Nutch-Hadoop, or to use web query
languages, such as YQL.
Finally, Web Data can also get assessed from different views. For example, back links
of a company’s site can be used as a sign of company’s social communities
2 . 4. 4 S ema n t i c L ay e r
On (82) the NLP (Natural Language Processing) and IE used for unstructured data from
Web are included on the term Web-ETL. Those data are then taking part on the
decision-making process. For example, IE techniques are used on a financial risk
assessment module and specifically on company profiles. If a company comes for
Russia which economic rate from Fitch has fallen, then the system will response with
a raise of risk for the specific company.
• Give the opportunity for new data sources, not yet analysed, to be integrated
on the BI process
We should notice here the issue of Ontology population where the same object has
different Ontologies name on different ontology definitions.
Except of using semantic layer into Web-ETL stage, it can also used as an extra layer
on Data Warehouse Tier, providing users new capabilities. On a traditional relational
database the main approaches of data schemas are:
The concept around analyzing unstructured data from the Web are further discussed
on Chapter 3.
2 .5 S P E C I A L I S S U E S
2 . 5 . 1 . 1 C o g n i t i v e M a p D e v e l o p m e n t : W e a k S i g n s M a n a ge me n t
through PUZZLE System
Decision-making on strategic level is not only made from data based on past facts but
also on sparse chunks of information that come from the environment of the company.
On (63), researchers notice that a primary task for a Business Intelligence system is to
create sense from those sparse data in order to take advantage of opportunities
and avoid threads. Those sparse data are called “weak signs” and are difficult to
interpret due to their characteristics (El Sawy, 1985 and Rouibah, 1997). Business
Intelligence is a process defined by five phases of a cycle:
• Organising tracking
• Taking action
On (63) proactive weak signs (instead of reactive ones) are studied. Those signs do not
have direct correlation with any business process but can be the source of decision for
strategic change making the analysis of them even more harder.
The final output for the user is an environment on which he asks events for a specific
actor. He can then create links between events in order to produce cognitive maps –
puzzles. Following iteratively this process, he is able to observe the landscape from
different angles in order to finally conclude on links between objects.
(77)
On , approach proposed is aiming on situational awareness of user as the sum of
information perception, comprehension and projection. This process is called
Situation Assessment and can be supported by technology. The other notion that
affects cognitive process of a human is mental maps which are the set of rules and
assumptions that a person uses in order to make a decision. They act as reasoning
mechanism and affect Situation Assessment. According to the above, the decision
Cognitive theory implications on BI, has as an objective to help users handle the mass
of information that traditional BI systems are producing. The system on (77), simulates
human cognitive process by taking as input a human query and retrieving for its
analysis relevant business cases from its database in order to conduct it. The results of
the analysis are shown to the user and he evaluates them, in an iterative process which
helps him enhance his own cognitive process.
2 . 5. 2 Pr ed i ct i v e A n a ly ti cs
(48)
According to “predictive analytics are used to determine the probable future
outcome of an event or the likelihood of a situation occurring”. Predictive analytics
view data from a different perspective than traditional BI: it searches for unknown
patterns, series of data and in general events that are included on categories which the
user would not ever queried for – as he does not they exist. Methods used include a lot
of tools like clustering, decision trees, text mining and other.
As (48) says “the core element of predictive analytics is the predictor, a variable that
can be measured for an individual or entity to predict future behaviour”
(85)
According to except of giving the right data, BI systems should also provide it
“at the right time, at the right location, and in the right form to assist
decision makers” . They introduce the notion of proactive BI where time frame from
integrating the data until the time an action is passed into the system should be
minimized. This notion is similar and supported by the real-time data warehousing
architecture discussed on a chapter before.
2 .6 C O N C L U S I O N S – S I T U AT I O N A L B U S I N E S S
INTELLIGENCE
(45)
On is introduced the term of “Situational Business Intelligence”. Researchers
describe a business environment with “a long tail of situational applications”. What
they mean is that this environment is shaped not only by the critical applications that,
traditionally, were monitored and analysed in order to offer information into
Management (e.g. ERP, CRM), but also by a number of structured and unstructured
sources, internal or external of the company. The characteristics of those sources
make it difficult to assess them but also increasingly important to do so. Moreover,
the value of information collected under an environment of Situational Business
Intelligence, decreases over time
(45)
As it is noticed on “answering Situational Business Intelligence queries
requires a close interaction between components for gathering text data, for
extracting structured data from text, for cleansing extracted data, for obtaining a
schema from the extracted data and for processing the extracted data on top of
the generated schema.”
3 .1 C U R R E N T S I T U AT I O N
The environment described on the chapter before produces two basic categories of
information: opinions and facts. Both of those categories are mainly unstructured
information, expressed on a way not easily machine-recognizable. Facts were the
main element of Web on its initial state (90s’ and early 00s’) – mainly a static source
of information. Analyzing facts, has being the theme of much research focusing on
Web Mining techniques.
However, Web 2.0 has turned everyone into Editor. The production of information
from the source of Users is abundant and constant. Their importance is extremely big
as more and more users choose this kind of stream in order to get informed. A
blogger’s opinion reviewing a new laptop can have the same gravity for potential
customers in the same way as the editor of PC Magazine. A tweet commenting on the
speech of US President could affect more than the article of Reuters Chief Editor for
the same issue.
The analysis of such information would give valuable, real-time, on-spot trends on
Decision Makers. Thus, the most important first material for an environment of
Situational Business Intelligence as it was described above. Those kind of techniques
could serve a variety of scopes: Businesses would be able to know and understand
what is the view of their customers for a product of them without having to conduct a
marketing survey. An organization could measure the initial effect that a policy
change or a decision had to their stakeholders and rapidly restructure their strategy.
In a few words, now are the ages that business -and not only- entities have the most
direct reach to their audience, whatever this is: customers, buyers, voters. Such an
opportunity should not be left unexploited.
This side would be also evolutionary: capitalization and perfect market rules have as a
basis a consumer with total knowledge of the market. However, no tool is available
for him to be fully informed on a globalized marketplace.
The effort to find an effective tool to address those problems is a difficult one, as the
exploitation of such data would normally require a lot of human involvement and
man-hours. However, the abundance of such valuable data and the current trends for
Web (constantly growing active users’ base with more involvement, 3.0 which
promises a global net of Interconnected Objects, Internet of Things) make such a need
much more intense. What we are searching virtually is effective Web-ETL processes,
transforming and integrating unstructured information into the same data warehouse.
3 .2 O P I N I O N M I N I N G : T H E T E R M
3 . 2. 1 D ef in it i on
An opinion differs from a fact on that it carries a very important emotional load.
Analyzing this emotional dimension is a matter of Opinion Mining field. Opinion
Mining (alternatively Sentiment Analysis) combines Natural Language Processing
and Text Analytics techniques in order to address the problem of extracting
qualitative attributes from a text. The results of such a task, on a given text, would
answer questions like:
• What are the attributes of the opinion? (negative, positive, neutral, strong,
weak)
The goal is the development of techniques and tools that would be able to process
large quantities of opinionated texts in order to answer on the questions above and
visualize its results on a way that would be easily comprehensible for the end user.
Liu defined Opinion Mining as “..the task that aims to extract attributes and
components of the objects that have been commented on in each document d €D and
to determine whether the comments are positive, negative or neutral, with D being a
set of evaluative text documents that contain opinions (or sentiments) about an
object..” (4)
Synonym S1(f1)
Feature F1
Object Synonym S2(f1)
O Feature F2 ………
……………..… Synonym S2(fn)
Feature Fn
A basic part of contacting the analysis parsing techniques are used. As defined on (5)
parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence
of tokens to determine their grammatical structure with respect to a given more or less
formal grammar. Parsing is also an earlier term for the diagramming of sentences of
natural languages, and is still used for the diagramming of inflected languages, such
as the Romance languages or Latin. Assigning a syntactic and logical form to an
input sentence
3 . 2. 2 E a r ly H is t o ry
Early research was done from late 70s on specific fields that were in the family of
Opinion Mining (e.g. Machine Learning, AI). From the mid 90s (with the early works
of Wiebe, Hatzivassiloglou etc.) and especially from the 00s’ Sentiment Analysis has
met a lot of interest with much research work done on it. Reasons were the
development of Web from a Read-Only state to a Read-Write One with the emergence
of Web 2.0 and the abundance of data that it created, as well as the rise of techniques
on scientific fields like Natural Language Processing (4).
3 . 2. 3 O p in i on M in in g M a i n C a t e g o ri za tio n
Approaches on Sentiment Analysis vary on different research efforts. One common
diversification is between Document Level and Sentence Level.
Finally, Sentiment Analysis on Attribute (or Feature) Level introduces a new stage
on which features are identified into the text. Opinion words are then connected with
specific features of the object. This is critical, in order to create a more clear view on
what was the elements that Opinion Holder liked and what was those that he didn’t. In
a pros and cons work, this kind of analysis is required.
3 .3 O P I N I O N M I N I N G : T H E M E T H O D
3 . 3. 1 S u b t a s ks
Sentiment analysis task is divided on a number of subtasks. Even though various
papers identify required subtasks slightly differently, a common analysis include the
steps on the diagram below:
1.Corpus Collection
2. Corpus pre-processing
3. Lexicon Development
4. Subjectivity Classification
6. Sentiment Classification
7. Visualization - Summarization
The tasks and their series described above are not strict as a process. For example,
identifying opinion holder is not so important on blog posts commenting on products
(as the holder is usually the one that posted the comment). On the other hand, on
specific occasions Topic Identification is conducted or, when using Semantic
Technologies, Ontologies Development is crucial. However, those tasks are a
common process covering the most important aspects of Opinion Mining methods.
3 . 3. 2 C o rp u s C o l le ct i on
The corpus that is further analyzed is collected via a crawler that runs through
designated URLs in order to collect pages of interest. As all first material is on
HTML, specific process is taking place in order to isolate parts that may contain
Opinion content. This task is called HTML Parsing. Such tools take advantage of
common HTML syntax, which virtually consists of a collection of strings into tags.
(32)
used tree-like structure of HTML pages handling each part with a distinct leaf of
the tree. They used then a wrapper agent which extracts all the words that are
contained on leafs of interest using a supervised technique. The specific method
refers to extracting data from structured text and not unstructured.
On (15) , is discussed the issue of comment extractions from blogs. As author’s view is
insufficient to shape , the researchers search for a method to extract comments
automatically from blog posts’ page. The writers propose a “page-level” approach vs
the “site-level approach”. On the later, there is human cost in order to identify patterns
on each site, and build new ones for every new site. In order to apply the former one
they use a technique that combines a set of predefined rules and a supervised learning
technique. Firstly, HTML page is parsed. After that, HTML is processed in strings in
order to find tags that could be a head of repetitive pattern. That tokens are further
examined in order to find if any rules can be formed. Those rules are dividing
comment form non-comment classifiers. After that, SVM is adopted to learn a
comment/non-comment classifier.
3 . 3. 3 C o rp u s p re -p ro c es s in g
After collecting the text corpus and prior of proceeding with its analysis various tasks
of pre-processing may take place. Those can be:
• Stemming. Words are reduced to their root. For example, word “cats” will
become “cat”. For that morphological analysis stemming algorithms can be
applied. However, there do exist parsing tools which process a text, like
Stanford Parser
(http://nlp.stanford.edu/software/lex-parser.shtml)
• Synonyms. It is possible that after stemming words, they will be a further step
in order to reduce even more the words analyzed on the text. This can be done
by grouping together synonyms and set them on their “gloss root”, which
means a common word on which all synonyms will be set.
• POS Tagging. Hatzivassiloglou and McKeown (2003) have noticed that the
presence of adjectives on a text is a good indicator of text polarity. In later
research on movie reviews Pang et al. (20) noticed that using only adjectives as
polarity indicators performed worse than using also nouns and verbs. In any
case, Part-Of-Speech Tagging is performed in order to use intuition like the
above. Again, tools like Stanford Parser can perform this task.
(6),
• On researchers try to align ordinary opinions (which are, as they say, a
better and unbiased source of information) with expert opinions (which are
structured texts, but of not so much value). By using semi-supervised methods,
they align ordinary opinions to the “template” expert opinion structure
3 . 3. 4 L e xic o n D ev e lo pme n t
Opinion Mining is based on the presence of a Lexicon of terms. A lexicon is a
collection of words that are used as identifiers in order to define the attributes of a
text. The terms that we are interested in are Feature Words – words that are used for
the object or its features – and Opinion Words – words that bear polarity on the
specific domain. Opinion words are further categorized on the ones that express the
same feeling whatever the object is - like word “excellent” (domain-independent
terms) - and words that may have totally different meaning depending on what is the
theme (domain-dependent terms). For example, “hot” may have a negative meaning
on travel domain (hot weather) but a positive one on movies domain (hot movie)
Even on the same domain and object, the same word may have different orientation
for different features. For example small size is positive for a camera but small
capacity is negative. The lexicon developed in that level define the prior polarity
(Wilson et al) which means its meaning out of its current context.
Creating an efficient lexicon is not an easy work to do. The abundance of words on
each language as well as other challenges - further discussed later (synonyms, sarcasm
(7)
etc.) - make polarity definition hard even for humans. On a training set was
manually collected and annotated by users as opinionated or not as well as its polarity
and its feature. Even on manual annotating, annotators disagreed on a high enough
percentage about the polarity and the feature targeted on the sentence.
Techniques used are divided into supervised and unsupervised. Supervised methods
use machine-learning techniques in order to “teach” a classifier recognizing words
from a training set. Unsupervised methods on the other hand mostly use syntactic
rules and grammatical rules, manual-constructed lexicons or even statistic classifiers
(31)
but without a learning step. On , researchers noticed that supervised methods
were more accurate than unsupervised but they rely a lot on the training set and needs
time to train. So, they thought unsupervised methods as more appropriate for real-
time tools
• Conjunction Method,
possible on the same subset. The group with the highest average frequency is labelled
as positive.
(4)
Liu on later paper, enhanced conjunction theory noticing specific sub-cases and
introducing relevant parameters in his method:
Pointwise Mutual Information (Turney, 2001) This method is based on the intuition
that terms with the same orientation tend to co-occur in the same document. It
uses words “excellent” and “bad” as anchors of positive and negative polarity
respectively. In order to decide the polarity of an opinion word it queried AltaVista
Search Engine in order to find co-occurrences of the OW with word “bad” and then
with word “excellent” on Internet documents. Depending the “distance” from its two
anchor words the OW was classified as positive or negative. Turney and Litman
showed on their experiments that conjunctions method makes more efficient use
of corpora than PMI method, but the advantage of PMI is that it can easily be
scaled up to very large corpora, where it can achieve significantly higher accuracy as
noticed on (23).
(33)
The same intuition can be used on features extraction. On , researchers use PMI
scores of candidate features words for restaurant domain with the word “restaurant”
Other approaches or variations of the methods above have being also proposed. On (8)
and regarding opinion words, researchers tried to tackle domain-dependency issue.
They have done that by initially finding the positive / negative words with highest
frequency. Then they used WordNet in order to enrich the lexicon with synonyms,
and finally added words with high frequency but not in the generated list as domain-
specific words.
(9)
On , researchers used polarity anchors and Normalized Google Distance,
changing a little the traditional PMI Method
(7)
Manual lexicon development is also possible. On researchers thought as important
Feature Words on Travel Domain Tourist Attraction Names. As a recognition tool
for that case did not exist they prepared manually a list of tourist attractions. They
also noticed that using traditional ways to locate domain-specific words is not relevant
to travel domain, due to the fact that a tourist object can be anything. Therefore, they
3 . 3. 5 S u b j e ct iv i ty C l as s if i ca t io n
Some method handle Subjectivity Detection as a different step than Sentiment
Classification. Subjectivity Classification is defined as:
Opinion phrases are adjective, noun, verb or adverb phrases representing customer
opinions. Opinions can be positive or negative and vary in strength (18)
This step can be applied in areas on which subjective messages needs to be identified
(e.g. flame recognition on e-mail communication) or in the context of a sheer
sentiment analysis as a sub-step before proceeding with the polarity identification. It
has being shown that applying such a task could increase the success rates of a
sentiment analysis as, if objective phrases are taken into consideration, polarity
classifiers can be falsely affected.
Various methods have been proposed in order to find the subjectivity of a text.
Sentence-level Naïve-Bayes classifier has being used by Wiebe using the presence of
specific syntactic schemas (1999). Initial work on it found a strong correlation
between the existence of adjectives into a phrase and its subjectivity (Bruce & Wiebe
2000). On document level, the subjectivity of a text was defined according to the
number of times that specific lexical features (e.g. word “good”) were found on it.
However, that alone does not ensure its subjectivity as different context can change
the emotional load of a word. On (17) a potential subjective element is differentiated
from a subjective element which is an instance of it that is indeed subjective on the
specific context. Researchers managed to increase success rates compared to baseline
adjective feature by using a similarity-identification approach. Initially, they created
manually a corpus of subjective words and they identified subjective sentences by
identifying words of that list on them. Then they further refined the results, by
creating pairs of words they used the intuition that words that are paired with the same
relationships with the same words tend to be similar.
(34)
On researchers used a pre-annotated collection of articles and a Naïve-Bayes
classifier in order to find the subjectivity on document level. Success rates were
extremely high (97%). In order to complete the same task on sentence-level, they used
three different methods: Similarity approach, on which the subjectivity of a phrase is
judged according to its level of similarity with other, already tagged as subjective,
phrases, Single Naïve-Bayes classifier, and Multiple Naïve-Bayes on which each
classifier is based on a different subset of features (success rates were up to 91%).
(28)
On researchers a subjectivity detector. Instead of just evaluating each sentence
alone, they take into consideration coherence, which means that a sentence tends to
have the same orientation as its nearby sentences. In order to evaluate pairs of
sentences, they use minimum cuts graph-based formulation and they create an extra
factor that affects the basic subjectivity detector.
In general, finding subjective phrases before proceeding with other analysis, could
enhance the success of the results by helping polarity classifier from considering
irrelevant text.
3 . 3. 6 I d en t i fy Op in io n H o l d er
Opinion holder is the source of the opinion expressed on a specific passage of the text.
Even though on product reviews this is not essential (As the writer is also the Opinion
Holder) this becomes important on news reviews that may contain the opinion of
different people for the same topic or that may describe the sayings of another person,
e.g. a political figure.
The main methodology for this task use graphic models and extraction patterns
learning in order to identify syntactic and semantic patterns (41)
Also used syntactic path information between Opinion Word and Candidate Opinion
Holder Words and distance between them in order to identify possible relation. The
difference with the method above is that they used Maximum Entropy model in order
to rank the candidates.
(27)
uses combines a list of specific heuristics as well as syntactic rules and the position
of the candidate word (in the sentence and its respective to opinion word) in order to
identify opinion holder in Thai texts. Heuristic rules include:
• It must associate and collocate with identified opinion operators with certain
pattern
• It always occurs in the beginning of the sentence or near the beginning or end
of a quotation
• It frequently co-occurred with the topic words and entities in the query
3 . 3. 7 S en t i men t C l as s ifi ca t io n
Sentiment Classification stage is the main processing stage. Here, lexicon developed
on previous stage is used in order to classify the document. If the analysis is done on
document level then as indicator is used the presence of Opinion Words. If the
analysis is done on Feature Level then the Opinion Words should be joined with
Feature Words beforehand. In both types of Analysis except of polarity, other
attributes of Opinion Words can also be taken into consideration like subjectivity and
strength.
Methods used for sentiment classification can be categorized depending on the grade
on which the classifier used is trained beforehand on Supervised, Semi-supervised and
Unsupervised. The first ones are using extensively training corpus in order to “learn”
the domain behaviour, meaning the Opinion Words, the Future words and the relation
between them. They are more accurate, but also domain-dependent and require a lot
of human work for the training set to be prepared. The later one rely on syntactic and
semantic rules in order to classify the document.
The most common classifiers used from supervised methods include Naïve-Bayes,
Maximum Entropy and Support Vector Machine (SVM) with the later one proven
(35)
the most effective , reaching a very good level of accuracy (92%). However, the
main problem with classifiers is that they are single domain, as the same word may
have different meaning on different domains and the require a lot of man-hours to tag
and prepare the training set. Read has found that classifiers can also be temporally
(42)
dependent . On the latest years, the existence of ready data sets has increased the
usage of supervised methods. The domain dependency issue can be addressed by
using intuitions like finding words that have the same polarity on both domains, and
think of those as appropriate features.
Kamps and Marx (2002) used WordNet relationships to produce values for three
parameters on each candidate OW on the text: evaluative (good-bad), activity (active
or passive) and potency (strong or weak). Minimum Length Path (MPL) is measured
between seed words and words in question.
Other methods include the usage of techniques like Naïve Bayes, ME (Maximum
Entropy) and SVM. This can be combined with Natural Processing Language
techniques, which require a pre-tagged lexicon and a set of grammatical or syntactic
rules.
(27)
On it is noticed that the polarity of a paragraph is strongly affecting the polarity
of a sentence in it. Thus, after estimating the polarity of every sentence, it then
estimates the polarity of each paragraph in order to produce a better classifier. In
general, the context of a sentence can affect the polarity determination of it via a
related variable.
On (7), they use a First Sentence Feature as they think that usually states the overall
opinion of the author. This variable could affect the decided polarity for phrases
thought as neutral.
(26)
An interesting notice was made on were researchers found that PROS reviews
results are more accurate than CONS reviews because on the first ones reviewers use
more explicit terms.
(19)
On , researchers introduced relaxation labelling for finding the semantic
orientation of words in context. Relaxation labelling is a commonly known iterative
procedure. The variables that the researchers used were:
• a set of objects
• a set of labels
The algorithm passes iteratively from the objects, re-evaluating it each time with the
new neighbourhood scores. The process stops when it has reached, for several cycles,
the same score. Their research concluded on the development of a relevant Opinion
Mining Tool called OPINE.
Unsupervised methods do rely on the quality of the lexicon created. The methods for
creating such a lexicon were described on chapter 5.1.
If the analysis is done on feature level, then initially features words should be spotted
and relations between Opinion Words and specific Feature Words should be made. A
feature can be a”part-of” topic (such as the screen of a camera), or a property of the
part-of a topic (such as the size of a camera) (Hu and Liu,2004).
On (21), researchers used Turney’s PMI score, Osgood’s semantic values as well as
syntactic features – like the proximity OW-FW - as parameters for a SVM This
managed to obtain better scores than SVM with a “bag of words” approach.
Relations between Opinion Words and Feature Words can be done for words
belonging on the same phrase. However, some researchers search for pairs on n-
phrases distance in order to enhance their results. For example, an approach to tackle
coference issue that is analyzed afterwards is to search for an FW on phrases before, if
one does not exist on the phrase that an OW exists. So, a pair can be established if it
exists into a sentence or into a window of sentences (7).
In general to obtain pairs, method of adjacent adjectives is used, based on the theory
that usually a product feature and its corresponding opinion word co-exist on a
specific distance. In this way. Opinion Units (7) are established.
(8)
On , researcher based his analysis on grammatical rules in order to track pairs of
OW-FW. He recognized explicit pairs and implicit pairs. Explicit pairs were the ones
that they co-existed on the same sentence S. However, because on some sentence
many OW and FW exist, he used further grammatical analysis in order to establish
successfully the pairs. With the help of Stanford Parser he used the parameters below:
• POS Analysis
• Relation Sequence
• Negation Words
By keeping the most frequent such sequences, he established in this way dependency
template syntax that he later used in order to establish pairs.
On (6) researchers created a “background words” list. This can be, for example on texts
discussing about “i-phone” the words “i-phone” and “Apple”. By isolating those
words from the analysis, they allowing their classifier to concentrate on real cases of
words that affect the polarity of the text
• full sentences,
• from the first word behind the holder to the end of the sentences.
Methods can be divided on single-layer and dual layer. As an example, on (7) , a dual-
layer method was proposed on which the first layer classified a string as an Opinion
Phrase. If this test is successful, then the OW-FW is tested if it is valid on a second-
level layer.
(27)
Such a similarity-based approach was also introduced on in which tested
expressions are compared syntactically with template annotated expressions. When
similarities are found, then parts of test expression is assigned according to pre-
annotations on template ones.
(34)
uses a same intuition in order to find the polarity of a sentence. The intuition is
that objective sentences will have more similar structure to objective sentences than to
subjective, within a given topic.
(16)
On , researchers created trees with paths between a Topic word and a Sentiment
Word, for hotel reviews. They assumed that there is a conceptual relation between a
topic and a sentiment if they co-occur on a certain distance threshold. After a relation
is established, they check for a negation word on the path.
Negation Words (no, not, but e.g.) should take part on the analysis as they modify the
polarity of a OW-FW pair. They are considered valence shifters. One can search for
modifiers in the path between the Opinion Word and the Feature Word or
alternatively on a n-window of phrases around the Opinion Word (as done on (9)).
(18)
presents the term of Class of Feature. By identifying 3 types of “influential
keywords” for each theme: Principal, Complementary and auxiliary they apply a
different level of influence for each word.
Gradability also affects the strength of the opinion. Gradability is “..the semantic
property that enables a word to participate in comparative constructs and to accept
(17).
modifying expressions that act as intensifiers or diminishers.” Its presence alone
is a good indicator of subjectivity. Hatzivassiloglou and Wiebe (2000) used two
indicators for spotting gradability: the presence of modifiers indexed manually on a
list (little, very, somewhat etc.) and the presence of inflected forms of adjectives. A
log-linear statistical model used those two indicators in order to produce a final
decision about document’s gradability.
Then they find the frequency of Groups of Word with specific elements. Results
showed that a “Bag-of-Words” outperformed each of individual feature sets as those
were created from Appraisal Theory. However, combining “Bag-of-Words” with
appraisal theory prodcuced an accuracy of 90,2 %
Concluding, parameters that are taken into consideration for Opinion Mining Analysis
are:
• Part-of-Speech properties
• Syntactic Patterns
• Modifiers (Negation)
3 . 3. 8 V is u a li za t ion / S u m ma ri za t i o n
Visualization is extremely important for all tools that has an objective to represent
information. The information should be aggregated and presented on the end user in a
way that he can draw conclusions. A lot and not well designed data representation will
have the same result as too little data: a user not well informed.
• A textual summary which consists of its most important clauses (20), (40). The
task of this technique is to create a small text that is a comprehensive
summary of a document. Even this is a technique used primarily on Single-
(4),
Document Summarization but it can also on a Multi-Document
Summarization (Opinion Mining is a Multi-Document Analysis). In the case
of the later one, the output of the system can be the sentiment of a document
and its title (Ku et Al). Title is thought to be a good summary of a text.
• Simple Opinion Sentences that describe the polarity for features. This is
useful when a Features’ Level Analysis is done.
(30)
• Scaling (binary or multi-scale). On , researchers tackle the issue of
presenting results not a two-fold scale (bad/good) but on multi-step scale (eg
five stars rating system). Making experiments with human, they concluded,
that human are good on up-to-4 grades scales. On bigger scales only 5% of
the documents where placed on the most negative or most positive scale. So,
they defined their problem as four-class. They used e different methods: One
vs all, Regression and Metric Labeling.
(4)
measurable economic impact..” . Hence, it is used for a summarization
system to include both kind of data.
3 .4 C H A L L E N G E S - L I M I TAT I O N S
The research done up to now, as described above, has managed to bring to develop
methods that achieve significant rates of success (up to 90%). However, the
characteristics and polymorphism of human expression can not be easily predicted
and analyzed from machine. For example, sarcastic phrases are very hard to be
analyzed correctly from an algorithm. Researchers tried to minimize the errors
produced by this linguistic phenomenon by taking into consideration the polarity of
context phrases. Of course, this can be only partially successful.
Various intuitions are used to tackle this issue. On (38) candidate feature words
are identified based on the observation that, when the focus shifts from one
feature to another, the new feature is often expressed using a definite noun
phrase at the beginning of the next sentence
(11)
On , two different notions are introduced: firstly sentiment consistency
which means that human tend to express the same sentiment on two phrases on
the row (if a different object is not clearly stated). Then, they use only pairs of
OW-FW that are known to be linked. For example, opinion word cheap cannot
be linked with Feature picture quality even if coreference resolution
algorithms are shown such. Researchers used Supervised Machine learning
Techniques, in which a pairwise function is use in order to predict whether
two noun phrases are coreferent. ”Subsequently, when making coreference
resolution decisions on unseen documents, the learnt pairwise noun
phrase coreference classifier is run, followed by a clustering step to produce
the final clusters (coreference chains) of coreferent noun phrases.”
On (16), the problem was addressed with the help of MARS System.
(26)
• Usage of Synonyms, on researchers use the first two words on WordNet
for each future to produce a list of synonyms for each word on their lexicon.
Other correspondent lexicons can also be used to extend the initial one
3 .5 S P E C I A L I S S U E S
3 . 5. 1 U s e o f On t o l o g ies
Web 2.0 is here, Web 3.0 is coming. In the new era of Internet evolution, the objective
is for the data that exist on it to carry much more semantic information about them,
making it possible for machines to understand more of their qualities that they do
today. A simple example would be a search engine which, when you query it with a
phrase, it does not return you just the texts that contain that phrase, but all the links
that seem relevant to it (Wolphram Alpha search engine is a meta-search engine trying
to do that).
• Features extraction, as it creates new links – not existing at the initial lexicon –
using ontological relations.
On (22), researchers are using RDF for interoperability and integration with the rest of
MUSING Business Intelligence Suite. MUSING models in a Business Intelligence
Ontology subjective information such as reputation and reliability. Researchers are
taking advantage of this Ontology in the context of developing a suite which creates
an accurate picture of a Business Entity over time, mining relevant opinions.
Results of their analysis are tagged according to MUSING Ontology in order to be
able to use the relations of it on further analyzing and summarizing them
(37)
On researchers are developing a marketing tool that will allow user to better
understand its market. Their approach is based on Semantic Web Technologies, using
SIOC (Semantic-Interlinked Online Communities) – a vocabulary to represent
discussions and posts - and linking their data with public semantic tagged
datasets like DBpedia. The novelty of that work, is the definition of an Opinion
Mining Ontology. This ontology can work as an universal intermediate between the
domain under analysis each time and the Opinion Mining itself.
(44)
Further work on that field has emerged Sentic Computing is a new paradigm,
evolution of traditional Opinion Mining Techniques using instead of statistical
learning models common sense reasoning tools and domain-specific Ontologies. This
approach identifies the disadvantage of traditional OM tools, that cannot handle
effectively that fact and opinions expressed are context and domain dependent. Sentic
Computing applications include AffectiveSpace, Hourglass of Emotions, Human
Emotion Ontology, OMCSentics and SenticNet
3 . 5. 2 O p in i on M in in g an d S o c ia l W eb
Social Web has introduced a new challenge for Opinion Mining. Production of
information from 2.0 applications is huge. Twitter has emerged as the “news media of
the new era”. Data are produced in real-time in such a pace that traditional search-
engines (even behemoths like Google) are facing issues on indexing them. On the
other hand, as more people use that stream to publish their opinions, it becomes
very important for Opinion Mining Tools to be able to exploit them, taking moreover
into consideration that is a free-of-charge dataset.
Their analysis took into consideration only messages that contained explicitly a topic
keyword. They establish a simple ratio between negative and positive messages. A
negative message was any message containing a negative word (as this was
established by OpinionFinder subjectivity lexicon.) and the as positive any message
containing a positive word. Even though, researchers admit that the method they use
is a baseline one, with a high error rate, they assume that their large number of
sentiments will smooth the noise. Moreover they use the notice of Hopkins and King
(2010) that using text analysis techniques could be inaccurate, when the objective is to
assess population proportions.
Results of the analysis showed that, analyzing tweets using this method, even though
not a safe one, can be used as an indicator of the public opinion trends.
In general, the analysis of that stream has its own challenges. Indexing real-time data
is a challenge on which real discussion and progress is done in our days (e.g. the
development of application on Hadoop database). However, no further discussion will
be done on this, as it is out of the scope of the present. Concerning linguistic analysis
a part that should be taken into consideration is the short messages found on new Web
(1)
Services (like YouTube, Twitter etc.) as noted on . Grammar and spelling lack of
concern, abbreviations (eg. Lol) and emoticons are characteristics of that group of
content. As this kind of data frequently disobeys general linguistic rules, development
of a special lexicon and rules could be required.
3 .6 C O N C L U S I O N S
What was made clear on the last two chapters, is the evolution of Web, amongst other,
as an invaluable resource of information. Additionally, its much more different
character than traditional sources (virtually real-time and constantly updating, in very
big quantities, directly feedback from customers, low cost or even free) makes unwise
any decision for a company not to hear on it. Management should be able to “read”
that stream of information when evaluating business environment and taking
decisions.
Even though all methods used have limitations, the research seems promising. The
main effort is toward two objectives: increase the success rates of analysis – aka
identify correctly the sentiment orientation of a text – or minimize the effort for
making the analysis – e.g. by developing a domain-independent classifier that would
not rely on domain-specific learning procedure prior the analysis. On specific efforts,
the experiment results are impressive (over 90%) even though usually it is on a
domain-specific and controlled body. Moreover, as more lexicon resources are
available to users and techniques from AI and Semantic era are now broadly used
(Ontologies) it is very possible that current limitations will soon be overcome
creating eventually solutions whose error will be statistically insignificant.
All the above, are described in the context of a real Business Case of a Travel Agency
working on a SAP® Business One Platform. However, the framework is developed to
be globally applicable and independent of the underlying platform used or the domain
on which it operates. Even more, it is designed to easily adapt on those different
environmental parameters.
On the next paragraph, the Business Case is described followed by the description fo
the framework with emphasis on the Opinion Mining Module.
4 .2 B U S I N E S S C A S E D E S C R I P T I O N
The Business Domain of our case study situation is that of Travel. Travel Agency
works on a competitive environment on which a lot of active players. Moreover, as
every Greek company, it faces a declining market due to the financial crisis. A lot of
Internet platforms are ,also built and operate that create a globalized market with
strong competitors from abroad.
In that environment, information and knowledge is vital for the company. The prices
that it offers should competitive, so it should promote any special price offers from its
suppliers. Moreover, as their propositions are based on the uniqueness of the offer and
their experience on creating successful travels, they should/ know as much details as
possible for all hotel services they propose.
1. the Sales Agent, as a critical part of the company that could clearly get
benefit from enhanced information.
4 .3 F R A M E W O R K D E S C R I P T I O N
4 . 3. 1 G e n era l Pr es en t at io n
The framework proposed for a Business Intelligence System is covering the
requirements of a BPMS. Value chain is on the centre of its operation. An initial step
on the implementation is the identification of Business Processes and of KPIs in
order to monitor them
On the other hand, a very important role on the system is taking the exploitation of
unstructured Web Resources. This is done using two methods:
The system is supported by specific Ontologies that should enhance the capabilities of
visualization, querying the engine as well as making it more independent of the
domain on which each time it operates.
4 . 3. 2 M o du l e b y M odu le A n a ly s is
ETL - Data Warehouse: A data warehouse architecture is used in order to integrate
all sources on a single place and ensure a single source of truth. The exact
architecture of the database and ETL methods details are out of the scope of the
present, and will not be discussed further
As SAP© Business One covers the implementation of all the business processes
(including CRM) all data are stored on a single relational database which takes the
role of a Data Warehouse. No further integration is required or ETL techniques
applied as ERP data sources are virtually the only ones. Web-ETL module and Opinion
Mining techniques are taking, instead, the primary role of integrating data from the
“outer world” that shake that balance with their lack of structure but also with their
constantly growing importance.
B I D M E T S 86
Web-ETL: This module handles the data that they are sourced on Internet. Mainly
unstructured data, they should be integrated on the structured form of a relational
databases. The sources for such this module are:
• pre-designated URLs which are crawled regularly in order to draw data from
them. They are stored on the Data Sources Repository and they are mainly
used from the Opinion Mining module. This operation will be discussed
extensively on the next sub-chapter
• Web Services API. Many companies now have active pages on Facebook
through which they promote specific services or products of them. As this is a
major channel of communication with its customers, a company could drill
information using its API. Those sources are mainly used for the Web
Mining module. As this is out for the scope of the present, no further
discussion will be done.
Those tools are primarily addressing to the operational level rather than the strategic
level of decision making.
BPMS: Business processes are recognized and tied with the operation of the Business
Intelligence System. For each of those, KPIs are stored on the KPI Repository which
are monitored through Dashboards and Alerting System as a user’s feedback. If the
user observe deviations from the desired performance, he can then calibrate the
processes. The system tries to simulate a Closed-Loop BI system where system’s
performance is feed-backed into it as a data source for further analysis.
Analysis Layer: On the analysis layer, data mining techniques are used. Further
analysis on the exact techniques used are out of scope. What has to be mentioned here
regarding our design, is that the components of this layer should be delivered easily
on the Query Engine of the User, in order for him to apply them on the KPIs of the
system without technical knowledge.
Semantic Layer: A semantic layer is applied over the operation of the whole system.
The scope of this addition is two-fold:
• Enable users setup and “describe” the basic components of the system (e.g.
KPIs, Business Processes, Keywords etc.) without having to know
programming language. This is the main reason for keeping a Business
Intelligence Ontology.
On the core of the system there is Business Intelligence Ontology. This describes all
the concepts on which BPMS is based which are:
{Business Process}
{has name}
{has owner}
{has trigger}
{Domain Class}
{has name}
{relates with business process}
{relates with domain class}
{has database ind}
{Database Ind}
{has name}
{has database position}
In a few words Database Ind is the connection point between the database and the
conceptual world of the domain. All the visualization and query engine is structured
around the notions of Business Processes and Domain Classes. The first ones are
measured with KPIs. All those Ontologies are defined from the user for the specific
domain for the system to operate.
The view presents information for the URLs on the Data sources Repository. Also it
uses Keywords Repository in order to drill data using Twitter API
4 .4 D E S C R I P T I O N O F O P I N I O N M I N I N G S U B -
MODULE
4 . 4. 1 I n t rod u ct i on
Opinion Mining is an important module on our proposal. The unstructured data from
the “outside world” are constantly taking a graver part on decision-making process as
it was discussed on previous chapters. The framework proposed takes advantage of
those sources in two different ways:
• Show streams of data on Users through User Interface. This visualization does
not aim on providing integrated, processed knowledge but only a quick view
on most current news.
As Opinion Mining is a field of real interest for the years coming, a raw prototype has
being developed as a pre-mature stage of an operating solution. Its current status is
described below. The methods used are, in general, raw comparing to the latest
evolutions as they were described on Chapter 3, but it covers the objective of the
current thesis which was to trace the specific field and test the notions on an upper
level identifying issues which should be solved on a real product. The example
below is describing the appliance of the prototype on a Travel Agency. However, the
same prototype is designed to operate on other domains too changing specific
systems’ parameters, namely:
• Sources
• Lexicons
Analysis output may also need moderations in order to apply on the requirements of
the process it supports on each case.
(Note: on paragraphs following, there are in some points, in italic, improvement plans
that are scheduled to be done on the next phases of the development).
4 . 4. 2 B u s in es s C a s e
The basic task for a Travel Agent (Salesman) is to offer his potential customers Travel
Packages Offers that can be attractive enough for them to buy. Those packages, in
• Customer needs. For example, a customer travelling with his children would
like a family hotel instead of one that is suitable for honeymooners.
• Quality of hotel services, on specific critical areas like rooms, location etc.
• Offers that will make the offer more attractive. For example, offer for free
nights.
• Other conditions that may affect the decision. For example, an increase on
dollar/euro rate may make rooms on hotels out of the zone of Euro expensive.
During the Sale Process, the Agent uses data from the ERP database: a quality rating
that is set internally for each hotel, information like the facilities that each hotel offers
and the size of the rooms as well as comments from customer survey conducted each
time with their return from the travel. Again, data about the prices and the offers of
each hotel are available to the user.
Structure, operation and details about Sentiment Analysis module that are described
below.
The process for the lexicon development is as follows: Initially, feature words are
identified for each of the four areas of hotel services. Synonyms of them are searched
and added to the repository. (Improvement Note: As this is done now manually, an
integration wth SentiWordNet is planned). For each one of those identified Feature
Words, Opinion Words are identified and connected. The process to identify such
connections is to scan through a training corpus manually. (Improvement Note:
Instead of manual scanning, a supervised learning method is planned to be used on a
training corpus for the development of the lexicon. Glosses Theory and integration
with SentiWordNet will further extend the lexicon).
The analysis is mainly using syntactic heuristics in order to identify patterns that
reveal some kind of polarity, positive or negative, based on the two, manual
constructed, lexicons. The results are visualized and summarized on the four areas
mentioned above.
Subjectivity Analysis, even proved to have positive effects on the precision of the
results if applied before the sentiment analysis, has not being used on the present.
Four successive phases are taking place as soon as the text is “delivered” to the
mechanism for processing, as shown on the figure below:
• Text pre-processing where corpus collected is divided into phrases and into
words. In order to identify phrases, a list of delimiters is used. “Phrase” is also
a property of each Word which is used to identify pairs of Opinion Words –
B I D M E T S 94
Feature Words on a later stage. Words are the first material for the analysis
on the later stage.
• Define Subjectivity. On this phase, Feature Words and Opinion Words that
are included on the two lexicons are recognized on the corpus. As the same
Opinion Word may have different polarity for different Feature Words, pairs
should be identified. The “phrase” property helps on identifying pairs. A
simple intuition is used, that OW and FW co-existing on the same phrase are
considered as pairs, if this is allowed from the lexicons developed.
(Improvement note: Extended Conjunction Theory will be used for identifying
more effectively pairs of words and their polarity. This is defined as the
extensions made through the years on the initial Conjunction theory
(Hatzivassiloglou, McKewon) studying the polarity transfer of words on the
two sides of an an adjective).
As soon as the pairs are recognized, the polarity for the specific pair is drown
from the database.
In order to tackle coreference issue the simple intuition used is that in order to
change the Feature Word on which the text refers, the new FW should be
mentioned explicitly. Thus, Opinion Words without explicit links with FW are
connected implicitly with those on previous phrases (if those pairs are
included in the lexicon.
Figure 4.3: Opinion Mining enhancements on Basic Hotel SAP® Business One Form
B I D M E T S 96
Average rating of hotel for each service area
4 .5 C O N C L U S I O N S A N D F U T U R E W O R K
The development done until now had as a product a raw prototype where specific
methods are applied in order to conduct the analysis. The modules of the system
described above are either covered roughly from the current ERP platform of are still
described on conceptual level (mainly on what has to do with the Semantic Layer).
The scope of the future work is to integrate all modules on a unified platform,
independent of any other underlying platform the company uses
Regarding the Opinion Mining sub-module, the objective was to show that the
unstructured data from Web sources can be integrated into the Data Warehouse of the
company and enhance decision process. This seemed to have being done, as the
products of Sentiment Analysis is affecting the Sales Process of the Company.
Current State:
Raw Prototype
Second
Product:
First Product Semantic
Layer
Figure 4.5: Current State and future development of Sentiment Analysis Module
However, the current state of the module cannot support productive operations as it is
difficult to scale. Due to time constraints some aspects have not yet reached the
planned status. Lexicons are developed and maintained manually. Stemming is done
using brute techniques that are resource-intensive. Some phenomenon, like multi-
word phrasing are not covered and subjectivity analysis is not implemented in order to
measure its effect on the results of the analysis. Those steps are planned to follow on
the next stage were a First Product will be developed. The module on this stage will
Figure 4.6: Current State and next steps of Sentiment Analysis Module, Functions View
B I D M E T S 99
CHAPTER 5: CONCLUSIONS
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 100
5 .1 C O N C L U S I O N S
Business Intelligence will continue evolving into a more important part of Business
Operation as more data from more sources are becoming available and in lower cost.
Moreover, their integration with Business Process Management theories are making
them more valuable to the core business as they interact directly with the Value Chain
of a Company.
The framework discussed, even still mainly on theoretical level, presents a solution in
accordance with the need of having on-date data from various sources integrated on
the same Data Warehouse, tagged properly using an efficient metadata repository and
use. Those data are designed to give the ability to the users, through a friendly
environment, to query them in order to produce results. Moreover, a semantic layer is
added in order to extend the types of connections that can be established and, thus the
analysis that can take place. All this, using a BI Ontology as the core of the system in
order to make its extensibility and switch ability from domain to domain easy.
The future work consists of working on the development of those modules in order to
test the ability of this integrated system to work on a productive environment and
really offer what is meant to: well-organized information in a friendly environment on
which user, processes and data co-operate effectively for the result.
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 101
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 102
ANNEX I: OPINION MINING DB
SCHEMA
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 103
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 104
ANNEX II: REFERENCES
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 105
[1] Sentiment Strength Detection in Short Informal Text, Mike Thelwall (WASSA 2010).
[2] Evaluation and Extension of a Polarity Lexicon for German, Clematide & Klenner (WASSA 2010).
[3] Old Wine or Warm Beer: Target-Specific Sentiment Analysis of Adjectives, Fahrni & Klemner
(WASSA 2010)
[4] Opinion Mining and Sentiment Analysis, Bing Liu and Lilian Lee.
[9] OpAL: Applying Opinion Mining Techniques for the Disambiguation of Sentiment Ambiguous
[10] Determining Term Subjectivity and Term Orientation for Opinion Mining, Esuli and Sebastiani
[11] Resolving Object and Attribute Coreference in Opinion Mining, Ding and Liu
[12] From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, O’Connor et al
[13] Opinion Mining and Strategic Decision Making: Application of Priority-Pointing Procedure in a
[14] Opinion Mining Classification Using Key Word Summarization Based on Singular Value
[15] Comment Extraction from Blog Posts and Its Applications to Opinion Mining, Kao et al
[16] Discovery of subjective evaluations of product features in hotel reviews, Pekar & Ou (Journal of
Vacation Marketing)
[18] Identifying Themes in Social Media and Detecting Sentiments, Pal & Saha
[19] Extracting Product Features and Opinions from Reviews, Popescu & Etzioni
Techniques, Pang et al
[21] Sentiment analysis using support vector machines with diverse information
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 107
ANNEX II: REFERENCES
[24] Constructing Thai Opinion Mining Resource: A Case Study on Hotel Reviews, Haruechaiyasak et
al
[26] Opinion Observer: Analyzing and Comparing Opinions on the Web, Liu et al
[27] Incorporating Feature-based and Similarity-based Opinion Mining, Xu & Kit (CTL in NTCIR-8
MOAT)
[29] Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text, Kim &
Hovy
[30] Seeing stars: Exploiting class relationships for sentiment categorization with
[33] Ontolexical resources for feature based opinion mining:a case-study, Oltramari et al (6th Workshop
[34] Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the
[36] Web Data Extraction Based on Partial Tree Alignment, Zhai & Liu
[37] Towards Opinion Mining Through Tracing Discussions on the Web, Softic & Hausenblas
[41] Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns, Choi et.
al
[42] Using Emoticons to reduce dependency in machine learning techniques for sentiment
classification, Read
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 108
ANNEX II: REFERENCES
[45] Senting Computing, Merging AI and Semantic Web Techniques for Opinion Mining and
Sentiment Analysis
[49] A Visual Framework for Knowledge Discovery on the Web: An Empirical Study of Business
[50] Discovering Business Intelligence Using Treemap Visualizations, Ben Shneiderman, 2006
[51] Business Intelligence 2.0: Are we there yet? Gregory S. Nelson, 2010 (in the context of 2010 SAS
Forum)
[52] http://money.cnn.com/news/newsfeeds/articles/marketwire/0589146.htm
[53] http://www.b-eye-network.com/view/10275
[54] Key Issues for Business Intelligence and Performance Management Initiatives, Gartner
[55] A Comparison of Business Intelligence Strategies and Platforms, (Green Hill Analysis, 2002)
[56] Practical Considerations for Real-Time Business Intelligence, Donovan Scheider, (Yahoo
[61] Real-time Business Intelligence: Best Practices at Continental Airlines, Watson et al.
[62] The BI Watch: Real-Time to Real Value, Richard Hackathorn (DM Review, 2004)
[63] PUZZLE: a concept and prototype for linking business intelligence to business strategy, Rouibah –
Ould-ali, 2002
[64] Techniques, Process and Enterprise Solutions of Business Intelligence, Zeng et al. (2006 IEEE
[66] Real Time Business Intelligence for the Adaptive Enteprise, Azvine et al.
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 109
ANNEX II: REFERENCES
[69] Enhanced Business Intelligence – Supporting Business Processes with Real-Time Business
Analytics, Seufert-Schiefer
[70] The Next Generation of Business Intelligence: Operational BI, Colin White, 2006
[72] Approach to Building and Implementing Business Intelligence Systems, Interdisciplinary Journal
[74] Aligning Process Automation and Business Intelligence to Support Corporate Performance
[75] Integration of Business Intelligence based on Three-Level Ontology Services, Cao et al.
[78] Web Business Intelligence: Mining the Web for Actionable Knowledge, Srivastava-Cooley
[79] Business Intelligence Systems in the Holistic Infrastructure Development Supporting Decision-
[80] A Visual Framework for Knowledge Discovery on the Web: An Empirical Study of Business
[82] Natural Language Technology for Information Integration in Business Intelligence, Maynard et al.
[83] Business Process Monitoring and Alignment: An Approach based on the User Requirements
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 110