Sie sind auf Seite 1von 110

BUSINESS INTELLIGENCE ON 2.

0 ERA: A
FRAMEWORK DESCRIBED WITH EMPHASIS ON
OPINION MINING

by
Gerasimos Galatis,
ggal@ait.edu.gr

Supervisor
Dr. Sofia Tsekeridou,
sots@ait.edu.gr

A thesis submitted in partial fulfilment of the requirements for the degree

MASTER OF SCIENC E IN MANAGEM E NT OF B USINES S,


INNOVATION & TECHNOLOGY (MB IT)

ATHENS
2011
Declaration

I, Gerasimos Galatis, declare that the work presented in this thesis is original and no
part of it (including the document, the implementation code, etc.) has been copied
from other sources. Work related to this one is cited appropriately.

Gerasimos Galatis

30/03/2011

The work contained in this thesis “Business Intelligence on 2.0 Era: A Framework
Described with Emphasis on Opinion Mining” by Gerasimos Galatis has been
carried out under my supervision.

Dr. Sofia Tsekeridou

30/03/2011

Athens Information Technology


Abstract

Decision making is on the core of making business on every organizational level,


operational or strategic. The difficult task of processing data in order to produce
information – the first material for decision - has being proven fertile ground for the
integration of IT services with the Core Business Environment. The methods,
techniques and their application used to provide this kind of support are included
under the “umbrella” term of Business Intelligence (rooting from the same meaning
term of Decision Support Systems). Nowadays, BI is trying to intrude even more into
Business World – integrating with its Value Chain – and to exploit the new data
streams created under the notion of Web 2.0. The present tries to describe this new
environment and the trends of Business Intelligence applications as a response on that.
Attention is given on Opinion Mining Techniques as the primary way to exploit
unstructured data coming from Web and integrate them with traditional data sources
Finally, based on this proposed framework, the development of a raw prototype is
described, for a Business Intelligence System which embodies those trends and
enhances decision-making process .

Keywords:
Business Intelligence, Opinion Mining, Business Process Management Systems,
Business Intelligence 2.0, Real-Time Analytics
Acknowledgements

I would like to thank my supervisor, Dr Sofia Tsekeridou for her sincere support
E X E C U T I V E S U M M A RY

Decision making is a main task for managing a business on every level, from
operational to strategic. The first material for this process is information. IT has
contributed from 60s’ on the field of transforming data into information and
delivering it through the technologies and techniques which were caught under the
umbrella term of Decision Support Systems, later called Business Intelligence. BI era
is constantly evolving, integrating further with Business Operation and adapting on
the changes that take place in its environment.

Business Intelligence term is used broadly from early 90s’ after it was coined from
Howard Dressner. Various definitions are used, but in general it could be defined as
“the process of turning data into information and then into knowledge”(60). Business
Intelligence is based on the existence of a Data Warehouse on which data from all
sources are integrated and from which all views of information are dragged. This
architecture ensures a single view of the truth. For this type of architecture, the
abbreviation BI-DW is used BI framework consists of two main entries: getting data
in through which data from various sources are integrated into Data Warehouse using
ETL processes and getting data out which transforms data from Data Warehouse into
meaningful information. For the later, a number of methods are used but in any case
there should be created an environment on which uncertainty would be minimized and
even un-experienced managers would be able to take right decisions, by transforming
effectively raw data into decisions.

The perception of a Business Intelligence application is not easy. Main reasons are:

• the difficulties to measure the, mainly, intangible, effects of Business


Intelligence tools

• the time lag that exists between BI starts affecting decision process and the
time that that is visible on company’s results.

Thus, managerial support and a culture towards knowledge and information are
required.
The structure of a traditional BI-DW is composed of four main tiers:

• Data Acquisition tier

• Data Warehousing tier

• Data Analysis tier

• Data Distribution tier

Nowadays, this basic architecture is affected by two major trends:

• Further Integration with Business Processes through the application of


Business Process Management techniques in the field of Business Intelligence.
The products of this union are Business Process Management Systems on
which user interacts directly with the processes that produce value for a
company. A set of metrics, named KPIs, are measuring the performance of
processes. This feedback is used to calibrate this Closed-Loop Business
Intelligence System. Real-time Data Warehousing is a critical component on
this evolution as it decreases the time latency before an event taking place and
the acknowledge it.

• Exploitation of new data sources created under Web 2.0 environment. In this
context every person linked to the Internet has became editor producing
content concerning every object and aspect of our life. The successful
recognition and integration of all those, mainly unstructured, sources of data
would affect decision making process. For this reason, a new era under the
term Opinion Mining has being developed and is strongly active.

Opinion Mining combines Natural Language Processing and Text Analytics in order
to address the problem of extracting qualitative attributes from a text. The goal of the
research on this areas is the development of techniques and tools that would be able to
process large quantities of opinionated texts in order to answer on questions like:

• Who is expressing the opinion?

• What are the attributes of the opinion? (negative, positive, neutral, strong,
weak)

• What does the opinion holder likes and what not?


and visualize its results on a way that would be easily comprehensible for the end
user.

Opinion Mining can be done on Document Level, categorizing the whole document
as positive, negative or neutral, on Sentence Level or on Feature Level where each
object’s feature is graded on this scale. The basic steps for an Opinion Mining
Analysis is to Corpus Collection, Corpus pre-processing, Development of
Opinion Words and Feature Words Development, Subjectivity Classification,
Identification of Opinion Holder, Sentiment Classification and Visualization of
Output.

As for Lexicons’ Development, there can be used unsupervised methods on


supervised machine learning methods. In any case, this methods can be based on
already available Opinion Lexicons like WordNet, SentiWordNet, GI or not. Major
approaches of linguistic development are: Conjunction Method, PMI, WordNet
Exploring Method and Gloss Classification Method. For Sentiment Classification,
again can be used supervised or unsupervised learning methods. Their basis are
mainly syntactic and grammar heuristics as well as semantic notions.

A number of issues are raised during Opinion Analysis which make it difficult to have
100% success on it. Such issue are for instance coreference problem, multi-word
expressions, thwarted expectations, ambiguous word whose meaning depends on the
context and the text domain and dialect.

The use of Semantic Theory on the Opinion Mining field allows for analysis to
identify more types of connections and links between the objects as well as resolve
issues which are caused because of the few parameters that traditional analysis
recognize.

Finally, a framework is described for a Business Process Management System on


which the notions of Business Process Management are applied. Moreover, the
Opinion Mining module is having a major role on this framework recognizing the
importance of Web 2.0 produced data on the decision-making process. The product of
this framework is a raw prototype which sets the foundation for an integrated and
productive system.
TABLE OF CONTENTS

TABLE OF CONTENTS

CHAPTER 1: INTRODUCTION ..................................................... 15


1.1 INTRODUCTION .......................................................................................16
CHAPTER 2: BUSINESS INTELLIGENCE ON 2.0 ERA............. 19
2.1 CHAPTER INTRODUCTION ....................................................................20
2.2 BUSINESS INTELLIGENCE: THE TERM ................................................20
2.2.1 What does it mean ..............................................................................................20
2.2.2 Historical Evolution ............................................................................................21
2.2.3 Value and Challenges .........................................................................................23
2.3 STRUCTURE OF A BI-DW .......................................................................28
2.3.1 Data Acquisition Tier .........................................................................................29
2.3.2 Data Warehousing Tier .......................................................................................30
2.3.3 Data Analysis Tier ..............................................................................................31
2.3.4 Data Distribution Tier .........................................................................................32
2.4 BUSINESS INTELLIGENCE: EVOLUTION NEXT ..................................34
2.4.1 General Trends....................................................................................................34
2.4.2 The Green Revolution: Business Process Intelligence .......................................35
2.4.2.1 Closed-loop BI ............................................................................................................ 36
2.4.2.2 Business Performance Management ........................................................................... 37
2.4.2.3 Real-time BI ............................................................................................................... 39
2.4.3 The Purple Evolution: Business Intelligence 2.0 ................................................43
2.4.3.1 Environment 2.0 ......................................................................................................... 43
2.4.4 Semantic Layer ...................................................................................................49
2.5 SPECIAL ISSUES.......................................................................................50
2.5.1.1 Cognitive Map Development: Weak Signs Management through PUZZLE System . 50
2.5.2 Predictive Analytics ............................................................................................52
2.6 CONCLUSIONS – SITUATIONAL BUSINESS INTELLIGENCE ............53
CHAPTER 3: OPINION MINING FROM RELEVANT WEB
SERVICES 55
3.1 CURRENT SITUATION ............................................................................56
3.2 OPINION MINING: THE TERM................................................................57
3.2.1 Definition ............................................................................................................57
3.2.2 Early History.......................................................................................................58
3.2.3 Opinion Mining Main Categorization.................................................................59
3.3 OPINION MINING: THE METHOD ..........................................................60
3.3.1 Subtasks ..............................................................................................................60
3.3.2 Corpus Collection ...............................................................................................61
3.3.3 Corpus pre-processing ........................................................................................61
3.3.4 Lexicon Development .........................................................................................62
3.3.5 Subjectivity Classification ..................................................................................66
3.3.6 Identify Opinion Holder .....................................................................................68
3.3.7 Sentiment Classification .....................................................................................68
3.3.8 Visualization / Summarization............................................................................75
3.4 CHALLENGES - LIMITATIONS ...............................................................76
3.5 SPECIAL ISSUES.......................................................................................78
3.5.1 Use of Ontologies ...............................................................................................78
3.5.2 Opinion Mining and Social Web ........................................................................80

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 11


TABLE OF CONTENTS

3.6 CONCLUSIONS .........................................................................................82


CHAPTER 4: APPROACH OF BUSINESS INTELLIGENCE AND
WEB MINING METHODS FOR ENHANCED TRAVEL
SERVICES 83
4.1 INTRODUCTION .......................................................................................84
4.2 BUSINESS CASE DESCRIPTION .............................................................84
4.3 FRAMEWORK DESCRIPTION .................................................................85
4.3.1 General Presentation ...........................................................................................85
4.3.2 Module by Module Analysis ..............................................................................86
4.4 DESCRIPTION OF OPINION MINING SUB-MODULE ...........................90
4.4.1 Introduction.........................................................................................................90
4.4.2 Business Case .....................................................................................................91
4.4.3 General Description ............................................................................................93
4.4.4 Visualization Layer ............................................. Error! Bookmark not defined.
4.5 CONCLUSIONS AND FUTURE WORK ...................................................98
CHAPTER 5: CONCLUSIONS ...................................................... 100
5.1 CONCLUSIONS ....................................................................................... 101
ANNEX I: OPINION MINING DB SCHEMA............................. 103
ANNEX II: REFERENCES ............................................................ 105

SOCIAL-BASED LEARNING 12
LIST OF FIGURES

LIST OF FIGURES

CHAPTER 2
2.1: From Database Management to Real-Time Business Intelligence and BI 2.0
2.2: The role of BI Systems in Decision Making, (Olszak and Ziemba, 2007)
2.3: BI Performance Management
2.4: Basic BI-DW System Structure
2.5: Evolution of BI to strategic tool, Yahoo Data Strategy Team
2.6: Gained Value reducing action time
2.7: How Web transformed on its 2.0 state
2.8: How Web 2.0 create, use and affect data handling

CHAPTER 3
3.1: Feature Words – Objects Relation
3.2: Indicative Steps on Opinion Mining Process
3.3: Index of Basic Methods for Sentiment Analysis
3.4: Evolution of Opinion Mining from “bag-of-features” to Semantic

CHAPTER 4
Figure 4.1: BPMS based on Ontologies, with Opinion Mining Sub-Module
Figure 4.2: Opinion Mining Analysis Process
Figure 4.3: Opinion Mining enhancements on Basic Hotel SAP® Business One Form
Figure 4.4: Sample of report produced with BusinessObjects® software
Figure 4.5: Current State and future development of Sentiment Analysis Module
Figure 4.6: Current State and next steps of Sentiment Analysis Module, Functions
View

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 13


CHAPTER 1: INTRODUCTION
INTRODUCTION

1 .1 I N T R O D U C T I O N
Decision. The process of choosing a path against all alternatives taking into
consideration all affecting parameters in order to produce a desired result.
Information. The main element for a person that has to take a decision which
describes a part or the whole of a parameter. Those two notions are strongly correlated
on the field of decision-making. A Person that needs to take a decision should be
aware of all the environmental parameters that affect it. On Business field the same
thing stands. Manager taking a decision should comprehend the environment (internal
and external) on which his company operates. Even though, this does not guarantee
that the desired result will be accomplished, it is still the main prerequisite.

Business Intelligence era is the evolution of Decision Support Systems that are
developed from the midst of the twentieth century. Under this umbrella term it is
included all hardware, software and methods which were used to support the decision
on every level of the company. As the environment on which a Business operates is
extremely complicated – and even more today, in a much more globalized economy –
its analysis is very difficult. The quantity of data is very big for human brain to
analyze them. The first DSS Systems task was to do what a Computer does well:
quick computations in raw data according to predefined algorithms, presenting results
on a human-comprehensive way, as a first material for the decision maker.

This is still a main task for BI software but not the only one. Today, BI takes a more
important role also on the next steps of the cognitive process of decision-making,
going steps further from just summarizing raw data. It also integrates more with the
core of the Business, processes, in order to affect it effectively and directly. For being
just an informer it is an active performer on a lining business environment. Moreover,
it introduces new ways for the Business to handle Information Overloading that
comes on Web 2.0 environment; information so valuable, yet difficult to be handled.
The vision of such a system is to fully support decision-making process, in order to
help and protect managers when leading their operations, as well as taking an active
role on business’s value chain rather than the role of a simple informant.

On this document, the current trends and situation on Business Intelligence era are
thoroughly investigated. Attention is given to Opinion Mining, as a group of
techniques go handle unstructured data from Web, whose number and importance for

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 16


INTRODUCTION

a Business is constantly growing. A framework will be built according to which a


prototype will be presented for a specific Business Case on a Travel Agency. Even
though this prototype is not productive, still is an initial effort to describe a Business
Intelligence System that combines the characteristics of a BPMS with a Semantic
Layer and mechanism to exploit unstructured data sources.

On Chapter 2, are discussed the major notions around the term of Business
Intelligences as well as the trends that shape the latest evolutions that have taken
place. Objective is to create a clear picture of what Business Intelligence might look
like in a few years from now

On Chapter 3, Opinion Mining techniques are discussed, as a primary method of


processing and integrating data with traditional data sources. As Web, takes a primary
role on news, comments and in general information generation, the unique challenges
that are created during the effort to collect and process those unstructured data should
be efficiently tackled

On Chapter 4, a framework of a BI prototype is discussed, for a specific case of a


Travel Agency. In this framework, notions discussed on the above chapter are
embodied in order to offer to the Travel Agency state-of-the-art solutions. Examples
of implementing those solutions are discussed, giving extra attention on Opinion
Mining tool.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 17


CHAPTER 2: BUSINESS INTELLIGENCE ON 2.0 ERA

CHAPTER 2: BUSINESS
INTELLIGENCE ON 2.0 ERA

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 19


BUSINESS INTELLIGENCE ON 2.0 ERA

2 .1 C H A P T E R I N T R O D U C T I O N
Under the umbrella term of Business Intelligence are gathered all the techniques,
processes and methods that are used in order to enhance decision-making on corporate
environment. On the chapters below, the term, the current situation and the evolution
path to more broad, collaborative and responsive BI.

2 .2 B U S I N E S S I N T E L L I G E N C E : T H E T E R M
2 . 2. 1 W h a t d o es i t mea n
The phrase Business Intelligence was initially heard on a work of Hans Peter Luhn
(1958) who defined Business Intelligence as “..the process that provides means for
selective dissemination to each of its action points on accordance with their current
(86)
requirements or desires.” . However, it was established as a common term
regularly used on today’s landscape from Howard Dressner on early 90’s.

In general as “Business Intelligence” is an “umbrella term”, (meaning that it can


include many things that serve the purpose of delivering decision-making material)
does not hold one and only definition. There can be found various ones from different
researchers or commentators. In that landscape, it seems right to define the term
broadly as it was done on(60) where Business Intelligence was defined as “the process
of turning data into information and then into knowledge.”

BI framework consists of two main entries (57):

- Getting data in, or else referred as Data Warehousing “means moving data
from a set of source systems into an integrated data warehouse”. The flows
are coming from different sources, internal or external of the company,
structure or unstructured and in various forms (“heterogeneous platforms”)..
Data should be integrated and transformed in a common shape in order for
further analysis to be able to take place. The intermediate processes on this
stage are included on the group of ETL processes (Extract, Transform, Load).
For lowering the load on Data Warehouse, the process of ETL, many times, is
taking place on an ODS (Operational Data Store) on which data
transformation is taking place before loaded on Data Warehouse. Moreover, in

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 20


BUSINESS INTELLIGENCE ON 2.0 ERA

many implementations Data Marts are used, which are smaller data
repositories that serve specific users (of specific department, geographic area,
application). All data marts should use the same Data Warehouse in order to
ensure a “single version of truth”. Getting data is the most challenging
aspect of BI, “requiring about 80% of time and effort and generating more
than 50% of unexpected project costs. “

- Getting data out, which is the process that really adds value for the business.
This function consists of transforming data form data warehouse / marts into
meaningful information. It can take place in the form of enterprise reporting,
OLAP, querying etc.

Business Intelligence process should be categorized in two main categories according


to the level of management that they refer to. The type of information and BI
producing process is different for each level, creating two distinct landscapes:

• Strategic Level. The decisions are mid and long-term. Data should be updated
but not real-time. Accuracy and comprehensive visualization of a complex
environment are important.

• Tactical (operation) level. Data should be right-time delivered in order to


support quick decisions to run day-to-day operations

2 . 2. 2 H i s t o ri ca l E v o lu t io n

Business Information Delivery has really evolved from the midst of twentieth
century . Even on the relatively new Business Intelligence époque (midst 90s) the
(85)
landscape is constantly changing. As coins for the Business Intelligence status:
“The emergence of the data warehouse as a repository, advances in data
cleansing, increased capabilities of hardware and software, and the emergence of the
web architecture all combine to create a richer business intelligence environment
than was available previously.”

Business Intelligence systems are having their roots on Decision Support Systems,
whose investigation started on late 60s’. The research effort on this period of time was
to “study the use of computerized quantitative models to assist in decision making and

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 21


BUSINESS INTELLIGENCE ON 2.0 ERA

planning” (87).Applying computerized quantitative models to assist in decision making


was the notion behind that effort.

Those systems supported Businesses until late 80s’, when Data Warehouse notion was
introduced. The problem that Data Warehouse tried and managed to tackle was the
multiple data sources integration under a common umbrella. The approach on
handling data until then was application-centric: any application the company
operated had its own database with data, even for the same event. Moving into a
common data-source for any DSS application, the systems evolved to a data-centric
approach. What they managed to do was to have a single version of truth.

The term “Business Intelligence” seemed to be a paraphrase of DSS, that started be


used after it was coined by Dresner. However, a technology that was linked with
Business Intelligence term and revolutionized data handling for businesses was the
introduction of CUBE operator on mid 90s’. It was then possible to conduct multi-
dimensional analysis and use OLAP techniques that are now seem essential for every
BI Suite. (58)
DSS & Database Management (70s-80s).
Reporting came from multiple operational sources. Application-centric

Business Intelligence & Data Warehouse Management (90s).


Single Version of Truth. Data Centric. Analysis of past and prediction of future

BPI Business Intelligence 2.0


Right-time and responsive systems Integrating new sources on traditional
supporting critical business processes, BI Data Warehouse
customer-facing
Figure 2.1: From Database Management to Real-Time Business Intelligence and BI 2.0

The next steps of evolution for Business Intelligence tools are towards two main
directions:

• Business Process Management Systems, with the support of BPM notions


and real-time (or better right-time warehousing

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 22


BUSINESS INTELLIGENCE ON 2.0 ERA

• Business Intelligence 2.0, where data sources generated by the new


environment of Web 2.0 are being recognized as an essential part of
Management Information

The further evolution, as a combination of those two major trends, is further


discussed on Chapter 2.4

2 . 2. 3 V a lu e an d Ch a l len g es
Every business operates on an environment with many variables and dimensions. A
managerial decision, on whatever level is being taken – strategic or operational – is a
response to a stimulus from its environment – internal or external. In any case, it is
obvious that it’s of the greatest importance for any decision-maker, to be properly
informed. The main objective of Business Intelligence era is to process information
from all sources, integrate them and visualize them on a comprehensive and user-
friendly way that will help decision-maker to take the right decision. The output of BI
tools can be:

• pattern discovery

• cause-effect cases

• statistical analysis

• what-if scenarios

• mind-maps

and generally notions required for a person to take a decision but demanding a lot of
time, effort and insight if made by a human. On its most evolved stage, Business
Intelligence should create an environment on which uncertainty would be
minimized and even un-experienced managers would be able to take right
decisions, by transforming effectively raw data into decisions, following the chain
pictured below.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 23


BUSINESS INTELLIGENCE ON 2.0 ERA

Figure 2.2: The role of BI Systems in Decision Making, (Olszak and Ziemba, 2007)

As expressed on (72) “ Value of BI for business is predominantly expressed in the fact


that such systems cast some light on information that may serve as the basis for
carrying out fundamental changes in a particular enterprise, i.e. establishing new
cooperation, acquiring new customers, creating new markets, offering products to
customers”.

According to Gartner (2002) the main scopes BI served on strategic level are:

• Corporate performance management


• Optimizing customer relations,
• monitoring business activity and traditional decision Support
• Managing Specific operations or Strategies
• Management reporting of Business Intelligence

(79)
notices the big importance of knowledge in order to operate effectively. He
recognizes four types of knowledge:

• Procedural knowledge, which explains how a task is done,

• Declarative knowledge, which explains what has to be done,

• Semantic knowledge, through which implicit relations between objects are


made

• Casuistic knowledge, that refers to past cases

B I D M E T S 24
BUSINESS INTELLIGENCE ON 2.0 ERA

As it further points out “organisations that are interested to use knowledge in


decision-making are forced to work out procedures that enable them to transform
tacit knowledge into explicit knowledge. In this situation, organisations find it
necessary to create repositories of knowledge and knowledge management systems,
simultaneously finding the way to match them with decision support systems”

Success of Business Intelligence implementation as it is perceived by users is not


easy. Main reasons are:

• the difficulties to measure the, mainly, intangible, effects of Business


Intelligence tools

• the time lag that exists between BI starts affecting decision process and the
time that that is visible on company’s results.

Thus, such an implementation should be by managerial support and a culture


towards knowledge and information. Business Intelligence changes have to do with
a generic cultural change and has to be lead by corporate vision As heard on a
symposium of Harvard Business School titled “Competing on Analytics – How fact-
based Decisions and Business Intelligence Drive Performance”, the ability to use BI
for competitive advantage “starts with CEO’s commitment and involves building the
necessary enterprise-wide infrastructure, analytical skills and culture. When done
successfully, competing on analytics creates value and strategic advantage”.

On (57) there were described the main success factors of BI Implementation:

• Senior Management Support

• Use of analytics is part of organization’s culture

• Alignment between Business and BI Strategies

• Effective BI Governance

• Existence of strong decision support data infrastructure

• User training and support

On (64), further factors are introduced

• Managerial support

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 25


BUSINESS INTELLIGENCE ON 2.0 ERA

• Established BI strategy in align with corporate strategy

• User acceptance and training

• Tangible results

• Ability to draw conclusions

From both lists, it seems obvious how important is the managerial support, the
alignment with company’s culture and users’ acceptance.
(72)
As described on “..Decision makers and organisations should predominantly
associate BI with organisational implementation of specific philosophy and
methodology that would refer to working with information and knowledge, open
communication, knowledge sharing along with the holistic and analytic approach to
business processes in organisations..”

On (72) it is noticed the high importance of User’s involvement on the whole lifecycle
of a BI implementation. This seems essential for most of application developments on
the premises of a company as the final output will serve, directly or indirectly, users.
(54)
Gartner notices that BI Applications should become part of users’ workflow. In
any other case, even if it has reached its requirements, users will not adopt it as they
will still have to cope with their main work duty.

The users’ involvement should start from the first step of requirements’ analysis and
should include:

• identifying and modelling knowledge;

• monitoring and modifying data repositories;

• creating their own analyses and reports;

• learning how to interpret results and ask sophisticated questions; and

• improving business and decision making on the ongoing basis.

In the path of proving the effectiveness and improving its performance, BI objective
measurement methods have being proposed. To measure BI is not an easy task to do
but is valuable. “It may be suggested that BI has no value itself. What bears the
business value is the decision that finally took place. The time lag between the

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 26


BUSINESS INTELLIGENCE ON 2.0 ERA

succession of BI objectives (which can be intangible, like quality improvement) and


their translation into financial outcomes can be big, and maybe not even directly
linked.” (71)

Davison (2001) has developed a measurement model called CIMM (Competitive


Intelligence Measurement Model) which returns the ROCII (ROI on CI). He tries to
absorb qualitative factors into ROI (e.g. decision-maker satisfaction). Even though it
is an initial effort, it has being proven unreliable.

Herring (1996) identifies four indicators for measuring BI success – time savings, cost
savings, cost avoidance and revenue enhancements – without, however, explicitly
creating some measurement means. Sawka (2000) proposes to apply measurements
depending on the contribution of BI on specific decisions. Davison (2001) proposed a
solution based on the perception of the users who are asked specific questions for the
success of a BI project. However, this also can be monetized.

Hoadley (2004) introduced “Hoadley Suite” which measures the effectiveness of a BI


(71)
system based on the completeness of data collected and their timeliness. On a
method of balance performance of measurement is proposed (based balanced
scorecard method)

Figure 2.3: BI performance measurement (71)

B I D M E T S 27
BUSINESS INTELLIGENCE ON 2.0 ERA

Finishing with the basic notions that comprise Business Intelligence, its value on the
business and the challenges it faces, we further proceed with the main elements of a
BI Implementation.

2 .3 S T R U C T U R E O F A B I - D W
A Business Intelligence System based on a Data Warehouse structure is the common
architecture today. In short, this is called BI-DW. The structure of such a system is
pictured below

Semantic Layer BI 2.0 BPI

Data Traditional Internal


Acquisition Tier Data Sources Collaborative
(ERP or other) Data

EAI

ETL
Process
Data
Warehousing
Tier
ODS
Metadata
repository

Data
Warehouse

Data Analysis
Tier Statistical Data OLAP
Analysis Mining

Data
Predictive Trend Data
Distribution Tier
Analysis Analysis Cubes

Alerting Dashboards

Figure 2.4: Traditional BI-DW structure

On the figure above, the structure of a basic BI-DW is pictured into the blue box. Out
of that box, there are extra layers / evolutions that are shaping the progress path of BI

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 28


BUSINESS INTELLIGENCE ON 2.0 ERA

era today, namely semantic layer, BI 2.0 evolution and BPI evolution. Those will be
discussed on the next chapter.

A BI-DW can be divided on 4 basic tiers that are followed successively from the
acquisition of the data till their presentation on the user:

• Data Acquisition Tier, where all required data sources are spotted and
crawled regularly in order to acquire updated information. This can be a pull
process where BI mechanism is searching / crawling for information or push
process where data are fed into the system whenever information is available

• Data Warehousing Tier, where data from different sources is integrated and
stored into data warehouse. Business Data Warehouse is a unified view of the
enterprise primarily for integrated reporting (Devlin & Murphy, 1988). The
existence of a DW secures the assures a single view of truth for all
applications on the premises of the company

• Data Analysis Tier, where data analysis is taking place in order to answer on
questions made to the system. The types of analysis vary and depend on the
type of question. Data Analysis Process can be called from any application
that operates on company premises and it is not necessary to reside on a
central mechanism. Every application can have its own BI tools. However,
data remains the same

• Data Distribution Tier, where data is being presented to the user. This can be
view-only , with the user receiving data either on-request or on specific time
periods or interactive where user have controls to the data shown to him in
order to check alternative scenarios or give feedback to the system e.g. mind-
maps.

The components on each tier in more analysis:

2 . 3. 1 D a t a A cq u is i t io n T i er
On that stage, data are collected from different sources. Those can be external to the
company or internal (feedback from business processes). In any case, all sources that
contain candidate data should be spotted in order to be monitored.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 29


BUSINESS INTELLIGENCE ON 2.0 ERA

Recently, the term EAI (Enterprise Application Integration) was introduced. Software
on that stage, serves as middleware on which all separate applications used on the
premises of a company are integrated. Thus, heterogeneous platforms are connected
through a single hub, without having to alter their structures or create links between
all platforms each other.

A recent challenge is the unstructured sources that are operating on the premises of
Web 2.0. Those feeds are invaluable and have unique characteristics that make them a
prerequisite for an effective BI to monitor, at least partially. More details on that will
discussed on next chapter

2 . 3. 2 D a t a W a reh o us ing T i er
(79)
As noted on “Utility of data warehouses largely depends on the quality of their
data stored”. A basic component on that level are ETL Tools. The acronym means
extract-transform-load and it describes all the processes that are responsible for those
three tasks. Extraction involves the tasks described on the last chapter and it is about
obtaining access to data originating from all candidate sources. Information like
extraction time, structure of source data etc. are also logged.

The data is then transformed in a series of actions that are thought to be the most
complex stage of the ETL process. As (79) describes “the process is usually performed
by means of traditional programming languages, script languages or the SQL
language. Data transformation means data unification, calculation of necessary
aggregates, identification of missing data or duplication of data”.

Data loading is the last stage on which data warehouse is getting updated with data
processed on the stages before. What is important is the speed on performing that
(79)
task. Again, as notices: “since the process in question frequently involves
switching the system into the offline mode, it is particularly important to minimise the
time that is necessary to transfer data”

Depending on the scope of realised functions ETL tools may be divided into four
categories (Meyer, 2001):

• EtL tools that gives more attention on extraction and loading tasks

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 30


BUSINESS INTELLIGENCE ON 2.0 ERA

• eTL or ETl that prefer specific types of input or output data (e.g. they function
exclusively in text files or specific database formats), and offer reliable and
fast functions of data processing and transforming

• ETL tools that realise a process of data transformation relatively well,


although they do not process effectively some data formats

• eTL tools that are complete integration environments

On some implementations, before the Data Warehouse an ODS (Operational Data


Store) is involved. The main reason for such an implementation is to diversify the
physical space where integration takes place from the one that participates on the
analysis, in order to enlighten the load for the latest which success and speed is
critical for business.

Moreover, on some implementations, Data Marts exists. Those structures are small
data warehouses which serve the needs of specific applications. Again, such a work is
done in order to balance the load and handle effectively different processes taking
place at the same time.
(69)
On Operational Data Stores and Data Warehouses as thought as a support
middleware layer between the transactional applications and the decision-support
module.

Finally, metadata repositories are also essential for a DW implementation. They


include information on data themselves. They facilitate a process of extracting,
transforming and loading data as well as they offer alternative solutions on
summarization level. (79)

2 . 3. 3 D a t a A n a ly s is T ier
Data Mining is “the process of identifying and interpreting patterns in data to solve
(64)
a specific business problem”. Data mining is looking for patterns and
relationships on data without knowing the question, e.g. without searching for
something specific. As described on (64) the steps of Data Mining are:

• Locate Business Issue. A generic question or group of questions is set as the


object of the data mining task. This has to be according to the data owned

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 31


BUSINESS INTELLIGENCE ON 2.0 ERA

• Data model development. On that stage data and metadata are being mapped
to the Business Issue that was previously recognized

• Data pre-processing in the form of data cleansing

• Mining Technique Selection. There are two main categories: Discovery


Mining with main objective to find patterns without prior knowledge of the
pattern (Clustering, Sequence etc.) and Predictive Mining that find
relationships between a specific variable, called target variable and other
variables (classification, regression etc.)

• Display results.

• Respond to mining results

OLAP (Online Analytical Processing), really revolutionized data processing and


their analysis from multiple perspectives. On that structure, data schema has more
dimensions than the regular two which take part on an analysis. As a result, user can
have results changing values to more than two parameters e.g. a sales analysis can be
done using timeline, geographic region and product group simultaneously.

2 . 3. 4 D a t a D is t rib u t ion T ie r
Objective on this stage, is to offer a comprehensive view of the data to the user. This
operation is extremely critical as data that cannot be used effectively, add no business-
value, whatever the quality of the analysis is. That is why so much analysis has being
done on that.

On (68) four types of visual display format for BI tools are identified:

• hierarchical displays, which are showing results on lists. It seem to be an


effective access tool, particularly for browsing

• network displays,

• scatter displays, which are effective on revealing data patterns

• map displays, which provide a view of all items on an upper level. They are
effective on showing a lot of data on a single view.

Other specific visualization tools used are:

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 32


BUSINESS INTELLIGENCE ON 2.0 ERA

• Alerting, which is applied an exception-management procedure where the


user does not have to scan reports to find situation that need attention but they
are only alerted for that situations in order to take appropriate action.

• Dashboards as the ability to inform users for performance.

• Ad hoc queries as the option for the users to create their own queries.

On that tier are also included the tools that system provides to the user in order to
interact with it. This can be either for posing the questions or for interacting with the
data that the system showed to him as results to his initial question.

As for the first case (query posing),(68) recognizes two types of queries:

• Specific Query Formulation, where user sets specific criteria in order for the
system to show him a specific set of results that agree with that criteria. This is
the traditional query mechanism

• Broad Query Formulation, where user poses a broad query and the system
shows him a broad selection of results for him to scan.

Regarding the issue of user interaction on the results, a good example is that of mind-
map tools where user gives feedback to the system in a number of iterative cycles
until a desired result is reached. A mind-map tool example is further discussed on a
next chapter.

On (65) it is discussed the ability that a BI system should have handle information not
only from data warehouse, but also from totally unstructured sources (a trend that is
discussed further on a next chapter). A system should respond to users’ requests with
keywords or query parameters showing data not structured. In order to tackle that
challenge, a metadata repository is used where objects of unstructured data are tagged
with keywords globally used on the DW and which also link them with structured
data.

On that basic architecture, several ideas are applied on the last years in order to help
companies operate more effectively on the new environment that is being shaped.
Those evolutions are discussed on the next chapter.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 33


BUSINESS INTELLIGENCE ON 2.0 ERA

2 .4 B U S I N E S S I N T E L L I G E N C E : E V O L U T I O N
NEXT
2 . 4. 1 G e n era l T ren ds
(59)
Gartner, recognized 2011 as the first year that trends on Business Intelligence era
were lead primarily “..from the need of users for easiness of use and flexibility than
the need of IT Department for control on data and standards..” . According to this, it
describes the strategies of the two major groups of BI software: Traditional
Enterprise BI Platforms and Data Discovery Platforms, that offer convenient tool
for information retrieval. The trend stated above would normally gives a window of
opportunity for the second group of smaller vendors that offer tools for easy
information discovery against big platforms vendors. However, big vendors continue
to have the biggest market share due to their promises for tighter processes integration
and vertical integration within their information infrastructure stacks. In any case, on
the above what is described is a turn to user friendliness and easiness of using the
data.

On its 2011 report, Gartner also predicted that “interactive visualization, predictive
analytics, dashboards, and OLAP usage will increase” even though the largest
proportion of BI processes are ad hoc reports. The new interfaces will push the
information from analysts to a larger portion of users. In a few words BI will become
pervasive, spreading to a larger user-base due to the availability of easy-to-
understand dashboards and Web-Based platforms accessible from any place Internet
Connection exists.

The major trends coming, as stated on the same report were:

• Consumerization of BI. “BI Tools must be simple, mobile and “fun” in order
to expand use and value.

• Extreme Data Performance Support

• Emerging Data sources

• BI embedded in the business process

• Collaborative Decision Making

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 34


BUSINESS INTELLIGENCE ON 2.0 ERA

(60)
Regarding the fourth (BI embedded in the business process) , on is noted the
change on Businesses that also bring a wind of change on Business Intelligence era
also. Companies are more process-driven, linking activities throughout the company’s
workflow in order to control results. KPI ‘s are set and watched in order to achieve
performance. Measurements are shared through all the company promoting
information democracy.
(56)
On the figure below (taken from Yahoo! Data Strategy Plan ) what is pictured is
the evolution from the initial state of plain transactions description to strategic
tool. A big step for tomorrow is BAM (Business Activity Monitoring). This is further
discussed later.

Transactional Reporting
“give me my reports”

Data Warehousing - OLAP


“explore data for interesting patterns”

BPM (Business Performance Management)


“align to Business Goals”

Guided Analytics / Business Activity Monitoring


“where I should look next”

Tactical Decisions
“what should I do right now”

Figure 2.5: Evolution of BI to strategic tool, Yahoo Data Strategy Team

2 . 4. 2 T h e G r een R ev o lu t io n : Bu s in ess Pro ce s s I n t el li g e nc e


A major trend for Business Intelligence area is the positioning of Business Processes
as a key element of the whole architecture, interacting easily with right-time data (as
are impressed on KPI’s). The approach now is process-driven , as a BI tool may offer
all the required modules to handle a process and not data-driven. We borrow the term
Business Process Intelligence from (Casati et al, 2003) in order to describe “a set of

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 35


BUSINESS INTELLIGENCE ON 2.0 ERA

integrated tools that supports business and IT users in managing process execution
quality” (84)

The notion of BPI, creates a new era where Business Intelligence really affects the
value-chain of a company, interacting directly with the elements that really create
value for a company: its processes. It is based on the theory of Closed-Loop Business
Intelligence and it embodies and integrates the notions of Business Performance
Management and Real-Time Data Warehousing. It creates an extra layer that
covers the operation of the whole BI Engine, as it was described on Figure 2.3. Those
three basic notions are described further below.

2.4.2.1 Closed-loop BI
Colin White Talks about a structure of Closed-Loop Business Intelligence (53), on as
BI System feeds Operations on their decision making process, an opposite flow also
exists on which Operations are feeding BI with data for analysis. What exist in the
middle is collaborative applications that help users make decisions for Operations,
using data from BI Analysis.

Closed-Loop Decision Making System, source: http://www.b-eye-network.com/view/10275

On this concept, an important stage is that of sharing information. One of the


important notions of this concept is the use of collaborative applications (rather than
traditional BI techniques) that enable users to make changes on the operational side.
The presence of an Operational Data Store is critical according to (69) for the operation
of a closed-loop business intelligence system, as through this information will be fed
back to the system .

The operation of a Closed-Loop BI system is critical for achieving a real-time


analytics architecture, as, through this cycle, actions decided on operational and

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 36


BUSINESS INTELLIGENCE ON 2.0 ERA

strategic level are passed quickly back to the system, decreasing the required lapse
time.

2 . 4 . 2 . 2 B u s i n e s s P er f o r m a n c e M a n a ge me n t
Business Performance Management is a management theory, prevalent on the last
years. It is defined as “a set of processes that help organization optimize business
performance by encouraging process effectiveness as well as efficient use of financial,
(60)
human and material resources” . BPM can be considered as a process-
optimization approach. A prerequisite for performing this type of management is to
be able to transfer the strategic goals of the company on the day-to-day, operational
level. This is done through the implementation of KPIs (Key Performance Indicators)
Those KPI’s should be fed “at the right time, at the proper decision level and in the
best form”.(60)

BPMS (Business Process Management Systems) is the package that is created from
the convergence of BPM with Information technology (and of course BI Packages),
“used to automate processes and provide process monitoring and improvement
capabilities representing a revolutionary way of using technology in the business
environment“(83). With BPMS what is tried to be done is to bridge the gap between
executing a process and measuring its performance. This can be done on an iterative
process where system is giving feedback to the user for process’s performance after
each change he makes.

On (84) the basic components of a BPMS are described:

• PDW Loader, which extracts data from all data sources, checks them and
integrates them into the PD Warehouse

• Process Data Warehouse, where data are stored for further analysis

• Process Mining Engine, which applies data mining techniques to data on the
PD Warehouse

• Cockpit, which is a graphical interface through which users are interacting


with the system. The output on this level can be:

o Low-level BI tools querying the PDW

o OLAP tools, analyze in a mid and long-term for strategic level


decisions

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 37


BUSINESS INTELLIGENCE ON 2.0 ERA

o Dashboards, visualizing the day-to-day operations for decisions on


operational level

o Alerts on all levels to spot unexpected events


(84)
o On , the addition of a semantic layer was proposed. Through its
established Ontologies, the user was able to create his own scenarios
using the notions of the established Ontologies.

Those elements are corresponding to the four tiers of a BI-DW described on the
previous chapter.

Applying BPM on BI comes into contradiction with traditional business monitoring


(83)
tools like Balanced Scorecards. On it is noticed how that new approach and the
tools that follow it offer a better insight on business processes, in contrast to
traditional BI techniques where there is “a latency between a business event and
monitoring the effect on business, and subsequently taking action. This problem
causes some actions to happen too late to prevent incidents.”
(60)
According to BPM differs from traditional DW-BI implementation, basically, on
the:

• Decision-making process is transferred from the strategic level to the users


on operational and tactical level who to deal with a subset of indicators

• The decisions on those levels must be faster, so information should be


refreshed regularly (fresh enough for right decision to be taken). Lifetime of
data is coarse. Real-time warehousing plays an important role on that

• KPIs should be monitored through a user-friendly interface through which


anyone should be able to analyze it . Dashboards, automated alerts, reports
with the specific indicators are examples of that interface.

The software that allows the monitoring of Business Activities is also referred as
BAM (Business Activity Monitoring). BAM is usually thought as “the technology
module of BPMS being the real-time reporting, analysis and alerting of significant
business events, accomplished by gathering data, key performance indicators and
business events from multiple applications” (60).

The main components of BAM are (60):

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 38


BUSINESS INTELLIGENCE ON 2.0 ERA

• Right Time Integrator (RTI), that integrates at right-time data from


operational database, DW, EAI and real-time streams.

• Dynamic Data Store (DDS), that is a repository capable of storing short-term


data for fast retrieving

• KPI manager that computes all the indicators necessary at the different levels
to feed dashboards and reports

• a Set of Mining Tools, capable of extracting relevant patterns of the data


streams

• a Rule Engine that monitors KPIs in order to alert users for events

Concluding, the use of BPMS on company’s operation, establish a process-centric


approach on BI Implementation which is line with the current views of adding value-
chain. The vision of this era is to create an environment where the user can
quickly check the indicators that measure the performance of business processes
and, thus, managing to handle them. As those are the core of Business Operations,
handling them is translated into creating business value. This way the use of BI tools
is fully calibrated with Management Objectives.

2 . 4 . 2 . 3 R e a l - t i me B I
Real-time Business Intelligence, is the answer to the need of users for fast responses
with fresh data to ad-hoc queries. It is enabled from Enterprise Information
Integration, Enterprise Application Integration and real-time data warehousing
technologies. As decisions on strategic level, are mainly long-term and this
(56)
technology refers primarily to tactical decisions. On some examples of real-time
analytics are mentioned:

• Fraud detection. Detect anomalies on credit card usage

• Web Targeting. Display an ad or content based on demographic profile, geo


location or behaviour

• Search Term Analytics

• Real Time Inventory Analysis

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 39


BUSINESS INTELLIGENCE ON 2.0 ERA

It is a fact operations like the one mentioned above do not have value if the data they
process are outdated. Traditional BI systems do need time in order to collect, process
and integrate information until the time it is available for analysis. From the time a
query is posted to the system again there is a time requirement until this is processed
and is available for presentation to the user. The fact above is presented on the time
lags theory below. (62)

Figure 2.6: Gained Value reducing action time (Hackathorn 2002)

• Data Latency, which is the time required from the moment an event occurred
until the moment it was stored in the data warehouse

• Analysis latency, which is the time between its storage and the moment its
analysis was finished in order to become available to users

• Decision Latency, which is the time between its availability and the moment
an action took place

The first two require changes on technical aspects, The third one on business
processes. Improvement on the first two alone, does not bring any business
value
(61)
However, as it is noticed on real-time is not usually the requirements. Instead,
right-time is what is needed. For example, for credit card fraud, instantaneous is not
the objective but instead the time latency should be some seconds. For other
applications, right-time means a time window of minutes or even hours

B I D M E T S 40
BUSINESS INTELLIGENCE ON 2.0 ERA

This is important to take into consideration as the requirements for a Real-Time


engine are not small. As (56) notices, considerations include:

• Money. It is not cost free to reduce latency. Specialized hardware and software
is required.

• It should give timely, normalized data, according to each user’s requirements


(role-specific-view of data)

On (70) , four types of right-time BI processing are recognized:

• Right-time data integration, which aim is to reduce data latency. Scope is to


produce a single view of data in business-wide level. Three are the main
techniques used:

o Data Consolidation, on which integration technologies are used,


which pre process data from different sources on a unified form. ETL
process is a tool of this type

o Data Federation, which, when a user issues a query, it provides a


single view of one or more sources. Enterprise Information Integration
(EII) is such a technology. This method, even though not appropriate
for large amount of data, removes the need for data consolidation on a
common data warehouse.

o Data Propagation, with which data are copied from one source to
another. This is a push operation, meaning that in order for the process
to begin, it does not wait for a request. Enterprise Application
Integration (EAI) is such a technology.

• Operational BI reporting, which aims to reduce data and analysis latency. It


can be achieved by applying Data Federation techniques and technologies, as
those described above. Another tool on that is the use of Operational Data
Stores (ODS) which serve as intermediate database, which is queried ad hoc
without affecting main application performance. However, this middle stage
should stay as real-time as possible also, affecting the power load required.

• Operational BI-Performance management, trying to reduce analysis latency

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 41


BUSINESS INTELLIGENCE ON 2.0 ERA

• Decision automation, with software agents, trying to reduce decision latency.


Example of those can be a BI alerting module which can work on various
levels: just inform user, propose solution or even take action

An infamous example of Real-Time analytics implementation it was the one on


Continental Airlines, which helped the company become from the last one on
Customers’ Satisfaction Indexes “Customers’ Favourite”(61). Continental Airlines
managed to generate over $500 million in revenue enhancements and cost-savings,
with an investment of $30 million in hardware, software and personnel ( ROI over
1000%) The modules developed included:

• real-time financial performance indicators. A flight’s economic figures were


known as soon as the “wheels were up”

• Flight dashboards, which helped operations identify issues into their flight
network. For example, on-time of flights was measured with real-time arrival
and departure times analysis

In order to balance the load on hardware they divided queries on the database on two
groups: tactical which should provide real-time data and are set as high-priority on
query engine and strategic.

Another view, on a real-time system, is the ability to transfer quickly users’ action
into the system. (66) gives gravity to real-time action of an RTBI package, meaning the
option of the user to intervene instantly with Business Operations in order to calibrate
them according to BI results. A main component of that scenario is process
dashboard through which user drives and change processes. Enterprise Application
Integration (EAI) suites are offering a solution for integrating heterogeneous
applications (different operating systems, different databases, different backbone
languages), on a common platform.

The alternatives for real-time analytics engine are :

• Custom Solutions. They are optimized for the specific needs, the initial cost
development is low and can adapt to meet changing business needs. On the
other hand it lacks integration with contextual data (amongst others)

• EDW (Enterprise Data Warehouses). Consolidate data marts into a central


data warehouse. No multi-database joins. Single source of truth. On the other

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 42


BUSINESS INTELLIGENCE ON 2.0 ERA

hand many EDW fail due to organizational reasons (dpts loose control of their
data and agility). It is not real-time and costly (up to 50$ millions for a large
organization)

• Virtual EDW. Provides virtual views of enterprise data. Each view is


optimized for specific need or workload. Dpts retain some control over their
data.

• Streaming, Business Activity Monitoring, Operational BI. Optimized for low


latency (few disk accesses). But data inconsistency , requires very high

• Increase frequency of ETL Operations

The combination of Real-Time Data Warehousing with BPMS create an environment


on which user interacts directly with the system, based on the business processes. The
objective is to affect effectively and instantly the value-chain in order to increase
the output business value.

2 . 4. 3 T h e Pu rp l e E v o lu t io n : Bu s in es s I n te ll i g en ce 2 .0
2 . 4 . 3 . 1 E n v i r o n me n t 2. 0
Web has transformed on the last decade in a extraordinary way, already noted by any
Internet commentator. On Web 1.0 there was a clear distinction between two worlds:
Net who contained information and Users who read them. Editors of that content were
person that ran their web-pages.

On Web 2.0, the tap opened. All users became content-creators. Data production
multiplied. This fact, the development of faster networks as well as the emergence of
Social Web apps, amongst others, lead more and more people to establish Internet as
their primary source of Connecting with the Community (communicating, getting
informed etc.). Web Environment 2.0 is a “living organism” with billion nodes
interacting each-other. Data is produced and evolved in great pace, in real-time.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 43


BUSINESS INTELLIGENCE ON 2.0 ERA

Figure 2.7: How Web transformed on its 2.0 state. Source: Dion Hinchcliffe (2006-04-02). "The State of Web 2.0",
Web Services Journal

The new role of those services on society today was highly noted on great political
and social events that took place the last years (Georgia, Tunisia, Egypt, Libya).
Twitter was thought as a primary news source, even for big media. Its real-time
engine created news much more quickly than traditional media means, even though
not so reliable.

Web 3.0, or whatever next evolution of Internet will be called, will even more
multiply the nodes of the Network and affect data produced. Internet of things,
Augmented Reality, Semantic Internet are already extending it in all existing and new
dimensions.
(51)
On , the position of Business Intelligence in the context of the present Web-
Environment 2.0 is established. Commentator recognizes that, in parallel with Web
2.0 evolution, BI Technologies have changed mainly on the output side: HTML
and PDF are used initially and then whole new platforms like Web portals and mobile
(52)
phones. However, in a world that Twitter grew 2500 % on 2009 , BI has not
absorbed the changes. As described on (51) BI 2.0 is the combination of:

• Proactive alerts and notifications

• Event driven/ real time/ instant access to information

• Advanced analytics

• Enterprise Integration

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 44


BUSINESS INTELLIGENCE ON 2.0 ERA

• Mashups and portal integration

• Mobile/ Ubiquitous access

• Improved visualization, Rich Interfaces (RIA)

• BI as a service (SOA and SaaS)

• In-memory analytics

• Open Source BI

Continuing, recognizes Web 2.0 services main aspects and the why they affect
Business Intelligence feature. Below, it follows a list of 2.0 services and facts about
the way they create, handle or visualize data. Those facts may mark a new époque
for Corporate Business Intelligence as well.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 45


BUSINESS INTELLIGENCE ON 2.0 ERA

Create New Data:: Content Creation through Constant User Involvement Environment

Create New Links for Existing Objects: Things I Like

Create New Data:: Platform of real


real-time information

Create New Links for Existing Objects: Any user can link information with relevant issues

Visualize data: Challenges to visualize real


real-time, abundant data

Create and Enrich Data:: Emergence of Collective Intelligence

Create Data:: Easy uploading of user created content

Enrich Data:: Rating of data uploaded creates a sentiment of its “social vibe”

Enrich Data:: Tag, comment on existing content. Collective Intelligence through people
bringing up information that think as valuable
Visualize data: Metrics regarding article popularity are visible

Create New Links for Existing Objects: Nodes(users) are creating new specific-
dimension connections (professional)

Figure 2.8: How Web 2.0 create, use and affect data handling

Corporate Business Intelligence 2.0

On (51) , a corporate environment is described where:

• Decisions are based on information created through “crowdsourcing”.


Constantly-evolving
evolving environment (internal and external) will feed Business
Intelligence Engine.

B I D M E T S 46
BUSINESS INTELLIGENCE ON 2.0 ERA

• Exceptions-based reporting will be offered. Notifications will alert user for


events he should give attention without him having to scan through a big mass
of data.

• Visual display of data will evolve and used broadly to summarize more
effectively than ever information

• Linkage with unstructured content will enrich collective knowledge

On that environment, there should be found a way to take advantage of the new
capabilities. On (78) the term of Web Business Intelligence was introduced in order to
define a BI system that will draw data automatically and in a meaningful way from
the constantly updated Web. The modules of such a system are:

• Content Acquisition Component, which is responsible for fetching,


normalizing and integrating content from various sources, internal or external.

• Knowledge Creation component, where data mining techniques are used in


order to create meaningful results and present it to the user

• Profile Database, which both modules are using for drawing data

Content Acquisition is divided on two subtasks:

• Information Retrieval, where potentially relevant sources are retrieved.


Relevant methods include

o Manual, with high precision rates and low recall rates

o Use of a crawler, either general where it just passes through pre-


designated URL for searching new content, without any other criteria
and constrained where additional criteria are applied

o Query engines, where a query is placed on an index in order for results


to return.

• Information Extraction, where information is extracted from the gathered


information. The task is becoming harder when the data is unstructured or
semi-structured. On HTML pages, the tag-structure helps identify the content
that has value with the construction of a wrapper or with machine-learning
techniques.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 47


BUSINESS INTELLIGENCE ON 2.0 ERA

High inter-correlated with the concepts already discussed is the Management of the
Knowledge created and the processes with which structured and unstructured data are
handled in order to present it in a way that enables action to be taken. The categories
are:

• OLAP / Reporting. Refers to structured data and can be descriptive or


predictive

• Pattern Discovery

• Relevance Ranking, which refer to the problem of finding the more relevant
sources between a big number of candidates.

On (80), researchers describe a solution on which key terms, fed by analysis on DW are
queried on Web crawlers in order to enhance knowledge. Search engines are queried
with keywords. Web Pages are parsed and indexed. Co-occurrence analysis took place
in order to identify groups of terms called “Web Communities”. The results were
visualized on a map environment.

The challenges on discovering and integrating unstructured data into Data Warehouse
(67)
needs special handling. proposes visualization methods based on hierarchical and
map displays as more effective on access and browsing of information. Paper use the
term of competitive intelligence tools to describe those that aim on systematically
collecting and analyzing information from the competitive environment to assist
organizational decision making. Data sources for those tools is primarily the Web.
That category is versus those tools that use massive amounts of data stored on data
warehouses in order to extract essential business information from them. Reviewing
those tools, they notice that they provide different views of the collected information
but no further analysis.

Common methods to access web content are also crawlers which materialize web
sites locally by following links, such as Nutch-Hadoop, or to use web query
languages, such as YQL.

Finally, Web Data can also get assessed from different views. For example, back links
of a company’s site can be used as a sign of company’s social communities

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 48


BUSINESS INTELLIGENCE ON 2.0 ERA

2 . 4. 4 S ema n t i c L ay e r
On (82) the NLP (Natural Language Processing) and IE used for unstructured data from
Web are included on the term Web-ETL. Those data are then taking part on the
decision-making process. For example, IE techniques are used on a financial risk
assessment module and specifically on company profiles. If a company comes for
Russia which economic rate from Fitch has fallen, then the system will response with
a raise of risk for the specific company.

In some efforts, Semantic Theory is used on the analysis of unstructured data. By


creating Ontologies, researchers create high-level lexicons for specific domains,
which they can later use on order to understand the meaning of a text or to create new
links between existing objects on a database.
(75)
On the example described on researchers introduce semantic layers on their
approach, used on telecommunications industry, in order to:

• Integrate data from various heterogeneous sources, in order to provide a


unified view to users

• Give the opportunity for new data sources, not yet analysed, to be integrated
on the BI process

We should notice here the issue of Ontology population where the same object has
different Ontologies name on different ontology definitions.

Except of using semantic layer into Web-ETL stage, it can also used as an extra layer
on Data Warehouse Tier, providing users new capabilities. On a traditional relational
database the main approaches of data schemas are:

• Star schema (Kimball,1998) According to it, description of a business is


stored on dimensions tables and their measures are kept on fact tables. Main
operations allowed by tools based on this approach (e.g. OLAP):

• Snowflake schema, which is a normalized version of Star Schema.

On this case, the main operations supported are:

• Slice, which reduces he dimensionality of a cube,

• Dice, which selects a set of data

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 49


BUSINESS INTELLIGENCE ON 2.0 ERA

• Drill-up, zooming-out level on data representation hierarchy

• Drill-down, zooming-in level on data representation hierarchy

• Drill-across, moving form a cube to another


(73)
As an example of semantic implementation we can study . There, researchers
added a semantic layer on Business Intelligence mechanism. BI aspects is
conceptualized providing to analysts the ability to describe the data using more
dimensions. Basic components of their system are: Ontologies which describe the
domain, Goals which describe the objective that users would like to achieve, Web
Services which are the means to achieve goals and Mediators which link two
components together.

Three different Ontologies are used:

• BI Domain Ontology, which describes Business Intelligence domain. This


description includes “dimensions, measures, filters, privileges and parameters

• Application Domain Ontology, which describe specific business concepts.

• Service Ontologies, which describe Semantic Web Services

Applying such a layer on Business Intelligence Analysis, allows the automatic


identification of concepts used during analysis definition from a user in order for the
system to make recommendation for the search. Further, search results are expanded
on dimensions that have to do with semantic relation between objects. On the specific
paper, when a user search for university, results that have to do with students also can
be shown as they are related semantically.

The concept around analyzing unstructured data from the Web are further discussed
on Chapter 3.

2 .5 S P E C I A L I S S U E S
2 . 5 . 1 . 1 C o g n i t i v e M a p D e v e l o p m e n t : W e a k S i g n s M a n a ge me n t
through PUZZLE System
Decision-making on strategic level is not only made from data based on past facts but
also on sparse chunks of information that come from the environment of the company.
On (63), researchers notice that a primary task for a Business Intelligence system is to

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 50


BUSINESS INTELLIGENCE ON 2.0 ERA

create sense from those sparse data in order to take advantage of opportunities
and avoid threads. Those sparse data are called “weak signs” and are difficult to
interpret due to their characteristics (El Sawy, 1985 and Rouibah, 1997). Business
Intelligence is a process defined by five phases of a cycle:

• Targeting business environment in scope of BI

• Organising tracking

• Routing weak signs into the organisation

• Interpreting those signs into interpretable information

• Taking action

On (63) proactive weak signs (instead of reactive ones) are studied. Those signs do not
have direct correlation with any business process but can be the source of decision for
strategic change making the analysis of them even more harder.

The interpretation of unstructured weak signs is based on a model of


“actor/theme/information” for each chunk of information. So, this work uses each data
object as applying to this generic model. Moreover, a main work is to associate with
new connections existing objects. As the data coming from different sources can be
of different or even opposite meaning different link types are implemented. Typology
of links include: causality, confirmation and contradiction. Information collected and
processed are used to develop “puzzles”, visual maps with links between weak signs.
“Puzzles” are modified constantly with new signs imported on the system. System
developed uses feedback from users as an input on restructuring links

The final output for the user is an environment on which he asks events for a specific
actor. He can then create links between events in order to produce cognitive maps –
puzzles. Following iteratively this process, he is able to observe the landscape from
different angles in order to finally conclude on links between objects.
(77)
On , approach proposed is aiming on situational awareness of user as the sum of
information perception, comprehension and projection. This process is called
Situation Assessment and can be supported by technology. The other notion that
affects cognitive process of a human is mental maps which are the set of rules and
assumptions that a person uses in order to make a decision. They act as reasoning
mechanism and affect Situation Assessment. According to the above, the decision

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 51


BUSINESS INTELLIGENCE ON 2.0 ERA

making process concerning an issue is the evaluation of different Situational


Assessment sessions until a feasible solution is found. The judgement of each
candidate solution is made based on pattern matching and informal reasoning rather
than analytical reasoning.

Cognitive theory implications on BI, has as an objective to help users handle the mass
of information that traditional BI systems are producing. The system on (77), simulates
human cognitive process by taking as input a human query and retrieving for its
analysis relevant business cases from its database in order to conduct it. The results of
the analysis are shown to the user and he evaluates them, in an iterative process which
helps him enhance his own cognitive process.

2 . 5. 2 Pr ed i ct i v e A n a ly ti cs
(48)
According to “predictive analytics are used to determine the probable future
outcome of an event or the likelihood of a situation occurring”. Predictive analytics
view data from a different perspective than traditional BI: it searches for unknown
patterns, series of data and in general events that are included on categories which the
user would not ever queried for – as he does not they exist. Methods used include a lot
of tools like clustering, decision trees, text mining and other.

As (48) says “the core element of predictive analytics is the predictor, a variable that
can be measured for an individual or entity to predict future behaviour”

(85)
According to except of giving the right data, BI systems should also provide it
“at the right time, at the right location, and in the right form to assist
decision makers” . They introduce the notion of proactive BI where time frame from
integrating the data until the time an action is passed into the system should be
minimized. This notion is similar and supported by the real-time data warehousing
architecture discussed on a chapter before.

Essential components of proactive BI are [Langseth and Vivatrat, 2003]:

• real-time data warehousing,


• data mining,
• automated anomaly and exception detection,

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 52


BUSINESS INTELLIGENCE ON 2.0 ERA

• proactive alerting with automatic recipient determination,


• seamless follow-through workflow,
• automatic learning and refinement,
• geographic information systems
• data visualization

2 .6 C O N C L U S I O N S – S I T U AT I O N A L B U S I N E S S
INTELLIGENCE
(45)
On is introduced the term of “Situational Business Intelligence”. Researchers
describe a business environment with “a long tail of situational applications”. What
they mean is that this environment is shaped not only by the critical applications that,
traditionally, were monitored and analysed in order to offer information into
Management (e.g. ERP, CRM), but also by a number of structured and unstructured
sources, internal or external of the company. The characteristics of those sources
make it difficult to assess them but also increasingly important to do so. Moreover,
the value of information collected under an environment of Situational Business
Intelligence, decreases over time
(45)
As it is noticed on “answering Situational Business Intelligence queries
requires a close interaction between components for gathering text data, for
extracting structured data from text, for cleansing extracted data, for obtaining a
schema from the extracted data and for processing the extracted data on top of
the generated schema.”

In that context, Business Process Management theory is integrated into BI tools in


order to directly affect the core of a business – its value chain. The notion of
collaborative decision making is introduced as a best practice supported by BI.
Real-time architecture is implemented to support those operations. The exploitation
of Web 2.0 data sources – mainly producing unstructured data - is broadly explored.
A Semantic layer is added in many BI propositions in order to discover new links
and offer new capabilities of analysing data. In general, BI technology is heading into
further integrating with Business Operations and expanding its sources’ reach in order
to become a really adding-value tool for every level of the organizational pyramid.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 53


BUSINESS INTELLIGENCE ON 2.0 ERA

On Chapter 3, it is discussed the issue of analyzing unstructured data from Web, as


this will be a main challenge on a world where Internet is taking the place of
traditional means on the spread of news and opinions.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 54


OPINION MINING FROM RELEVANT WEB SERVICES

CHAPTER 3: OPINION MINING


F R O M R E L E VA N T W E B S E R V I C E S

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 55


OPINION MINING FROM RELEVANT WEB SERVICES

3 .1 C U R R E N T S I T U AT I O N
The environment described on the chapter before produces two basic categories of
information: opinions and facts. Both of those categories are mainly unstructured
information, expressed on a way not easily machine-recognizable. Facts were the
main element of Web on its initial state (90s’ and early 00s’) – mainly a static source
of information. Analyzing facts, has being the theme of much research focusing on
Web Mining techniques.

However, Web 2.0 has turned everyone into Editor. The production of information
from the source of Users is abundant and constant. Their importance is extremely big
as more and more users choose this kind of stream in order to get informed. A
blogger’s opinion reviewing a new laptop can have the same gravity for potential
customers in the same way as the editor of PC Magazine. A tweet commenting on the
speech of US President could affect more than the article of Reuters Chief Editor for
the same issue.

The analysis of such information would give valuable, real-time, on-spot trends on
Decision Makers. Thus, the most important first material for an environment of
Situational Business Intelligence as it was described above. Those kind of techniques
could serve a variety of scopes: Businesses would be able to know and understand
what is the view of their customers for a product of them without having to conduct a
marketing survey. An organization could measure the initial effect that a policy
change or a decision had to their stakeholders and rapidly restructure their strategy.

In a few words, now are the ages that business -and not only- entities have the most
direct reach to their audience, whatever this is: customers, buyers, voters. Such an
opportunity should not be left unexploited.

Regarding Opinion Mining an essential part of Business Intelligence Weaponry for a


Business, it can also affect it not only on operational level but also on a strategic one.
(13)
On , researcher tried to use the theory for enhancing Strategy-Decision on a big
Chinese Public Organization, using PPP Methodology – which is based on open-
questions answers

On the other hand, such information – if summarized and visualized effectively –


could totally change the decision process of a buyer, a customer. Previous reviews and
opinions would be available and create a framework of much better decision-making.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 56


OPINION MINING FROM RELEVANT WEB SERVICES

This side would be also evolutionary: capitalization and perfect market rules have as a
basis a consumer with total knowledge of the market. However, no tool is available
for him to be fully informed on a globalized marketplace.

The effort to find an effective tool to address those problems is a difficult one, as the
exploitation of such data would normally require a lot of human involvement and
man-hours. However, the abundance of such valuable data and the current trends for
Web (constantly growing active users’ base with more involvement, 3.0 which
promises a global net of Interconnected Objects, Internet of Things) make such a need
much more intense. What we are searching virtually is effective Web-ETL processes,
transforming and integrating unstructured information into the same data warehouse.

3 .2 O P I N I O N M I N I N G : T H E T E R M
3 . 2. 1 D ef in it i on
An opinion differs from a fact on that it carries a very important emotional load.
Analyzing this emotional dimension is a matter of Opinion Mining field. Opinion
Mining (alternatively Sentiment Analysis) combines Natural Language Processing
and Text Analytics techniques in order to address the problem of extracting
qualitative attributes from a text. The results of such a task, on a given text, would
answer questions like:

• Who is expressing the opinion?

• What are the attributes of the opinion? (negative, positive, neutral, strong,
weak)

• What does the opinion holder likes and what not?

The goal is the development of techniques and tools that would be able to process
large quantities of opinionated texts in order to answer on the questions above and
visualize its results on a way that would be easily comprehensible for the end user.

Liu defined Opinion Mining as “..the task that aims to extract attributes and
components of the objects that have been commented on in each document d €D and
to determine whether the comments are positive, negative or neutral, with D being a

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 57


OPINION MINING FROM RELEVANT WEB SERVICES

set of evaluative text documents that contain opinions (or sentiments) about an
object..” (4)

An object O is represented with a finite set of features, F = {f 1 , f 2 , …, f n }.Each


feature fi in F can be expressed with a finite set of words or phrases Wi, which are
(4)
synonyms. . Feature is a product’s characteristic or an aspect of the object ( if this
is intangible).

Synonym S1(f1)
Feature F1
Object Synonym S2(f1)

O Feature F2 ………
……………..… Synonym S2(fn)

Feature Fn

Figure 3.1: Feature Words – Objects relation

A basic part of contacting the analysis parsing techniques are used. As defined on (5)
parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence
of tokens to determine their grammatical structure with respect to a given more or less
formal grammar. Parsing is also an earlier term for the diagramming of sentences of
natural languages, and is still used for the diagramming of inflected languages, such
as the Romance languages or Latin. Assigning a syntactic and logical form to an
input sentence

• uses knowledge about word and word meanings (lexicon)

• uses a set of rules defining legal structures (grammar)

3 . 2. 2 E a r ly H is t o ry
Early research was done from late 70s on specific fields that were in the family of
Opinion Mining (e.g. Machine Learning, AI). From the mid 90s (with the early works
of Wiebe, Hatzivassiloglou etc.) and especially from the 00s’ Sentiment Analysis has
met a lot of interest with much research work done on it. Reasons were the
development of Web from a Read-Only state to a Read-Write One with the emergence

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 58


OPINION MINING FROM RELEVANT WEB SERVICES

of Web 2.0 and the abundance of data that it created, as well as the rise of techniques
on scientific fields like Natural Language Processing (4).

3 . 2. 3 O p in i on M in in g M a i n C a t e g o ri za tio n
Approaches on Sentiment Analysis vary on different research efforts. One common
diversification is between Document Level and Sentence Level.

Sentiment classification on Document Level categorizes a document normally on a


two or three-scale rating system. The possible classes would be positive, negative, and
neutral (objective statement). The assumption used is that each document under
review focuses on a single object containing opinion form a single opinion holder. (11).
The polarity of the document depends on the existence and frequency of opinionated
and polarized words (e.g. good, bad etc.). The output of such a system is a
classification of the document in whole without more detail

On the other hand, Sentiment Classification on Sentence Level identifies the


sentiment on sentences of the document given. Different themes or Opinion Holders
can be identified on the same document. However, this method makes the assumption
that only one opinion is contained in a sentence which is not always true

Finally, Sentiment Analysis on Attribute (or Feature) Level introduces a new stage
on which features are identified into the text. Opinion words are then connected with
specific features of the object. This is critical, in order to create a more clear view on
what was the elements that Opinion Holder liked and what was those that he didn’t. In
a pros and cons work, this kind of analysis is required.

Another kind of categorization splits methods depending on the number of categories


(4)
produced. Pang and Lee on recognize Polarity Classification as the task that
classifies an opinionated document as totally negative or positive. This binary
classification is contrasted to Rating Inference which is a multi-class text
categorization problem.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 59


OPINION MINING FROM RELEVANT WEB SERVICES

3 .3 O P I N I O N M I N I N G : T H E M E T H O D
3 . 3. 1 S u b t a s ks
Sentiment analysis task is divided on a number of subtasks. Even though various
papers identify required subtasks slightly differently, a common analysis include the
steps on the diagram below:

1.Corpus Collection

2. Corpus pre-processing

3. Lexicon Development

4. Subjectivity Classification

5. Identify Opinion Holder

6. Sentiment Classification

7. Visualization - Summarization

Figure 3.2: Indicative Steps on Opinion Mining Process

The tasks and their series described above are not strict as a process. For example,
identifying opinion holder is not so important on blog posts commenting on products
(as the holder is usually the one that posted the comment). On the other hand, on
specific occasions Topic Identification is conducted or, when using Semantic
Technologies, Ontologies Development is crucial. However, those tasks are a
common process covering the most important aspects of Opinion Mining methods.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 60


OPINION MINING FROM RELEVANT WEB SERVICES

3 . 3. 2 C o rp u s C o l le ct i on
The corpus that is further analyzed is collected via a crawler that runs through
designated URLs in order to collect pages of interest. As all first material is on
HTML, specific process is taking place in order to isolate parts that may contain
Opinion content. This task is called HTML Parsing. Such tools take advantage of
common HTML syntax, which virtually consists of a collection of strings into tags.
(32)
used tree-like structure of HTML pages handling each part with a distinct leaf of
the tree. They used then a wrapper agent which extracts all the words that are
contained on leafs of interest using a supervised technique. The specific method
refers to extracting data from structured text and not unstructured.

On (15) , is discussed the issue of comment extractions from blogs. As author’s view is
insufficient to shape , the researchers search for a method to extract comments
automatically from blog posts’ page. The writers propose a “page-level” approach vs
the “site-level approach”. On the later, there is human cost in order to identify patterns
on each site, and build new ones for every new site. In order to apply the former one
they use a technique that combines a set of predefined rules and a supervised learning
technique. Firstly, HTML page is parsed. After that, HTML is processed in strings in
order to find tags that could be a head of repetitive pattern. That tokens are further
examined in order to find if any rules can be formed. Those rules are dividing
comment form non-comment classifiers. After that, SVM is adopted to learn a
comment/non-comment classifier.

3 . 3. 3 C o rp u s p re -p ro c es s in g
After collecting the text corpus and prior of proceeding with its analysis various tasks
of pre-processing may take place. Those can be:

• Stemming. Words are reduced to their root. For example, word “cats” will
become “cat”. For that morphological analysis stemming algorithms can be
applied. However, there do exist parsing tools which process a text, like
Stanford Parser

(http://nlp.stanford.edu/software/lex-parser.shtml)

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 61


OPINION MINING FROM RELEVANT WEB SERVICES

• Synonyms. It is possible that after stemming words, they will be a further step
in order to reduce even more the words analyzed on the text. This can be done
by grouping together synonyms and set them on their “gloss root”, which
means a common word on which all synonyms will be set.

• Translation. As Sentiment Analysis is strongly dependent on syntactic rules,


those that apply on English Language do not for the others. Texts that are not
on English may be translated initially, in order for the researcher to take
advantage of the tools and methods that already apply for this specific
language

• POS Tagging. Hatzivassiloglou and McKeown (2003) have noticed that the
presence of adjectives on a text is a good indicator of text polarity. In later
research on movie reviews Pang et al. (20) noticed that using only adjectives as
polarity indicators performed worse than using also nouns and verbs. In any
case, Part-Of-Speech Tagging is performed in order to use intuition like the
above. Again, tools like Stanford Parser can perform this task.
(6),
• On researchers try to align ordinary opinions (which are, as they say, a
better and unbiased source of information) with expert opinions (which are
structured texts, but of not so much value). By using semi-supervised methods,
they align ordinary opinions to the “template” expert opinion structure

3 . 3. 4 L e xic o n D ev e lo pme n t
Opinion Mining is based on the presence of a Lexicon of terms. A lexicon is a
collection of words that are used as identifiers in order to define the attributes of a
text. The terms that we are interested in are Feature Words – words that are used for
the object or its features – and Opinion Words – words that bear polarity on the
specific domain. Opinion words are further categorized on the ones that express the
same feeling whatever the object is - like word “excellent” (domain-independent
terms) - and words that may have totally different meaning depending on what is the
theme (domain-dependent terms). For example, “hot” may have a negative meaning
on travel domain (hot weather) but a positive one on movies domain (hot movie)
Even on the same domain and object, the same word may have different orientation

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 62


OPINION MINING FROM RELEVANT WEB SERVICES

for different features. For example small size is positive for a camera but small
capacity is negative. The lexicon developed in that level define the prior polarity
(Wilson et al) which means its meaning out of its current context.

Creating an efficient lexicon is not an easy work to do. The abundance of words on
each language as well as other challenges - further discussed later (synonyms, sarcasm
(7)
etc.) - make polarity definition hard even for humans. On a training set was
manually collected and annotated by users as opinionated or not as well as its polarity
and its feature. Even on manual annotating, annotators disagreed on a high enough
percentage about the polarity and the feature targeted on the sentence.

Techniques used are divided into supervised and unsupervised. Supervised methods
use machine-learning techniques in order to “teach” a classifier recognizing words
from a training set. Unsupervised methods on the other hand mostly use syntactic
rules and grammatical rules, manual-constructed lexicons or even statistic classifiers
(31)
but without a learning step. On , researchers noticed that supervised methods
were more accurate than unsupervised but they rely a lot on the training set and needs
time to train. So, they thought unsupervised methods as more appropriate for real-
time tools

As noticed on (23) ,there are four major approaches in developing linguistic


resources:

• Conjunction Method,

• Pointwise Mutual Information (PMI) Method,

• WordNet Exploring Method,

• Gloss Classification Method

Conjunction method (Hatzivassiloglou, McKeown, 1997): Their method is based on


the intuition that adjectives on the two sides of a positive conjunction (e.g. and) will
have the same polarity. On the other hand with words like “but” adjectives on the two
sides will have opposite polarity. After extracting the adjectives from the corpus, a
log-linear regression model is used to determine the polarity relation of adjectives
pairs. A clustering algorithm separates the adjectives into two sets of different
orientation is applied trying to place as many words of the same orientation as

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 63


OPINION MINING FROM RELEVANT WEB SERVICES

possible on the same subset. The group with the highest average frequency is labelled
as positive.
(4)
Liu on later paper, enhanced conjunction theory noticing specific sub-cases and
introducing relevant parameters in his method:

• Intra-Sentence conjunction rule. The opinion on both sides of and is not


always the same (eg. The camera takes great pictures and has a short term
of life)

• Pseudo Intra-Sentence Conjunction Rule. Sometimes, one may not use


an explicit conjunction “and”. Same opinion in same sentence, unless there
is a “but”-like clause E.g., “The camera has a long battery life, which is
great”

• Inter-Sentence Conjunction Rule. People usually express the same


opinion across sentences unless there is an indication of opinion change
using words such as “but” and “however”

Pointwise Mutual Information (Turney, 2001) This method is based on the intuition
that terms with the same orientation tend to co-occur in the same document. It
uses words “excellent” and “bad” as anchors of positive and negative polarity
respectively. In order to decide the polarity of an opinion word it queried AltaVista
Search Engine in order to find co-occurrences of the OW with word “bad” and then
with word “excellent” on Internet documents. Depending the “distance” from its two
anchor words the OW was classified as positive or negative. Turney and Litman
showed on their experiments that conjunctions method makes more efficient use
of corpora than PMI method, but the advantage of PMI is that it can easily be
scaled up to very large corpora, where it can achieve significantly higher accuracy as
noticed on (23).
(33)
The same intuition can be used on features extraction. On , researchers use PMI
scores of candidate features words for restaurant domain with the word “restaurant”

WordNet (Hu). WordNet is a Terms’ database developed from Princeton University


in which synonym words are classified into groups called synsets. Words relations
can be synonyms or antonyms. Regarding Opinion Mining, in order to find the

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 64


OPINION MINING FROM RELEVANT WEB SERVICES

orientation of an adjective, system searches for a known polarity word on the


synonyms or antonyms set of the adjective. As the WordNet sets are substantial in
size, lexicons developed can easily scale up by just searching the synonyms and
antonyms of each known word (seed word). Further, an Opinion Word has three
characteristics: potency, activity and evaluative (Osgood et. al, 1957). Those
characteristics are inherited on WordNet lexicon resource. On various methods, the
SO Value is defined by measuring the minimum path between value phrase and
(21)
“good” and then “bad”. On however it was found, that including those notions
was not helpful on the accuracy of results. WordNet is good in order to find a lot of
words with a small number of seeds but it is not affective on domain-dependent
words.

Gloss Classification Method (Esuli and Sebastiani): Researchers developed


SentiWordNet on which terms are classified according to their glosses. The intuition
used is that words with similar orientation have similar glosses.

General Inquirer: GI is an index of terms which contain a small definition of the


term as well as tags about its polarity (negative, positive) and other (overstatement,
understatement, negation term). GI does not include only Opinion Words but also
intensifiers, diminishers and negation terms. That lexicon can be used, as WordNet or
SentiWordnet as a database for analyzing an opinion document.

Other approaches or variations of the methods above have being also proposed. On (8)
and regarding opinion words, researchers tried to tackle domain-dependency issue.
They have done that by initially finding the positive / negative words with highest
frequency. Then they used WordNet in order to enrich the lexicon with synonyms,
and finally added words with high frequency but not in the generated list as domain-
specific words.
(9)
On , researchers used polarity anchors and Normalized Google Distance,
changing a little the traditional PMI Method
(7)
Manual lexicon development is also possible. On researchers thought as important
Feature Words on Travel Domain Tourist Attraction Names. As a recognition tool
for that case did not exist they prepared manually a list of tourist attractions. They
also noticed that using traditional ways to locate domain-specific words is not relevant
to travel domain, due to the fact that a tourist object can be anything. Therefore, they

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 65


OPINION MINING FROM RELEVANT WEB SERVICES

evaluate the likelihood of a word to appear on an opiniated sentence by computing a


ratio between the number of opiniated sentences that the word exists and the total
number of sentences that the word exists.
(14)
On , researchers create an exclude list containing terms that are commonly found
on the texts e.g. from movie reviews, word “movie” and are thought irrelevant. This,
created better results using common learning machine techniques (SVD, Cart and
Bayes Net)
(8)
On , researchers used WordNet, movie casts and labelled training data in order to
generate a keywords’ list. They also found that if they create again an exclude list
with the feature words with frequency lower than 1%, they still cover more than 90%
of occurrences. The number of remaining words for most feature classes was less than
20.

3 . 3. 5 S u b j e ct iv i ty C l as s if i ca t io n
Some method handle Subjectivity Detection as a different step than Sentiment
Classification. Subjectivity Classification is defined as:

Let S = {s 1 ,. . .,s n } be a set of sentences in document D. The problem of


subjectivity classification or subjectivity tagging is to distinguish sentences used
to present opinions and other forms of subjectivity (subjective sentences set S s )
from sentences used to objectively present factual information (objective
sentences set S o ), where S s [ S o = S.] (35)

In other words, it is to distinguish sentences, paragraphs or documents that present


opinions and evaluations from sentences that objectively present factual information
(8)

Opinion phrases are adjective, noun, verb or adverb phrases representing customer
opinions. Opinions can be positive or negative and vary in strength (18)

This step can be applied in areas on which subjective messages needs to be identified
(e.g. flame recognition on e-mail communication) or in the context of a sheer
sentiment analysis as a sub-step before proceeding with the polarity identification. It
has being shown that applying such a task could increase the success rates of a

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 66


OPINION MINING FROM RELEVANT WEB SERVICES

sentiment analysis as, if objective phrases are taken into consideration, polarity
classifiers can be falsely affected.

Various methods have been proposed in order to find the subjectivity of a text.
Sentence-level Naïve-Bayes classifier has being used by Wiebe using the presence of
specific syntactic schemas (1999). Initial work on it found a strong correlation
between the existence of adjectives into a phrase and its subjectivity (Bruce & Wiebe
2000). On document level, the subjectivity of a text was defined according to the
number of times that specific lexical features (e.g. word “good”) were found on it.
However, that alone does not ensure its subjectivity as different context can change
the emotional load of a word. On (17) a potential subjective element is differentiated
from a subjective element which is an instance of it that is indeed subjective on the
specific context. Researchers managed to increase success rates compared to baseline
adjective feature by using a similarity-identification approach. Initially, they created
manually a corpus of subjective words and they identified subjective sentences by
identifying words of that list on them. Then they further refined the results, by
creating pairs of words they used the intuition that words that are paired with the same
relationships with the same words tend to be similar.
(34)
On researchers used a pre-annotated collection of articles and a Naïve-Bayes
classifier in order to find the subjectivity on document level. Success rates were
extremely high (97%). In order to complete the same task on sentence-level, they used
three different methods: Similarity approach, on which the subjectivity of a phrase is
judged according to its level of similarity with other, already tagged as subjective,
phrases, Single Naïve-Bayes classifier, and Multiple Naïve-Bayes on which each
classifier is based on a different subset of features (success rates were up to 91%).
(28)
On researchers a subjectivity detector. Instead of just evaluating each sentence
alone, they take into consideration coherence, which means that a sentence tends to
have the same orientation as its nearby sentences. In order to evaluate pairs of
sentences, they use minimum cuts graph-based formulation and they create an extra
factor that affects the basic subjectivity detector.

In general, finding subjective phrases before proceeding with other analysis, could
enhance the success of the results by helping polarity classifier from considering
irrelevant text.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 67


OPINION MINING FROM RELEVANT WEB SERVICES

3 . 3. 6 I d en t i fy Op in io n H o l d er
Opinion holder is the source of the opinion expressed on a specific passage of the text.
Even though on product reviews this is not essential (As the writer is also the Opinion
Holder) this becomes important on news reviews that may contain the opinion of
different people for the same topic or that may describe the sayings of another person,
e.g. a political figure.

The main methodology for this task use graphic models and extraction patterns
learning in order to identify syntactic and semantic patterns (41)

Also used syntactic path information between Opinion Word and Candidate Opinion
Holder Words and distance between them in order to identify possible relation. The
difference with the method above is that they used Maximum Entropy model in order
to rank the candidates.
(27)
uses combines a list of specific heuristics as well as syntactic rules and the position
of the candidate word (in the sentence and its respective to opinion word) in order to
identify opinion holder in Thai texts. Heuristic rules include:

• It must be a recognized entity or pronoun

• It must associate and collocate with identified opinion operators with certain
pattern

• It always occurs in the beginning of the sentence or near the beginning or end
of a quotation

• It frequently co-occurred with the topic words and entities in the query

3 . 3. 7 S en t i men t C l as s ifi ca t io n
Sentiment Classification stage is the main processing stage. Here, lexicon developed
on previous stage is used in order to classify the document. If the analysis is done on
document level then as indicator is used the presence of Opinion Words. If the
analysis is done on Feature Level then the Opinion Words should be joined with
Feature Words beforehand. In both types of Analysis except of polarity, other
attributes of Opinion Words can also be taken into consideration like subjectivity and
strength.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 68


OPINION MINING FROM RELEVANT WEB SERVICES

Methods used for sentiment classification can be categorized depending on the grade
on which the classifier used is trained beforehand on Supervised, Semi-supervised and
Unsupervised. The first ones are using extensively training corpus in order to “learn”
the domain behaviour, meaning the Opinion Words, the Future words and the relation
between them. They are more accurate, but also domain-dependent and require a lot
of human work for the training set to be prepared. The later one rely on syntactic and
semantic rules in order to classify the document.

The most common classifiers used from supervised methods include Naïve-Bayes,
Maximum Entropy and Support Vector Machine (SVM) with the later one proven
(35)
the most effective , reaching a very good level of accuracy (92%). However, the
main problem with classifiers is that they are single domain, as the same word may
have different meaning on different domains and the require a lot of man-hours to tag
and prepare the training set. Read has found that classifiers can also be temporally
(42)
dependent . On the latest years, the existence of ready data sets has increased the
usage of supervised methods. The domain dependency issue can be addressed by
using intuitions like finding words that have the same polarity on both domains, and
think of those as appropriate features.

A simple way to classify a text on document level is using unsupervised methods


(20)
would be to spot specific words, of known polarity, on it. Pang et al found that
just evaluating the presence of a word in a text is more valuable than taking into
consideration its frequency. Phrases on which positive words were found will be
considered as positive and vice versa. This kind of analysis is called “bag-of-words”
as it just analyzes the existence of words without further investigating possible
grammar and syntactic patterns that exist.
(29)
On researchers use a third classification team for Opinion Words: neutral. By
classifying weak opinion words on that class they found that they prevent classifier by
making mistakes.
(44)
Turney used PMI method in order to classify documents’ sentiment. Phrases
containing adverbs and adjectives were extracted and evaluated for their semantic
orientation querying AltaVista Search Engine. Words “excellent” and “poor” were
used as seed words. The document is classified according to the average classification
of its component phrases.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 69


OPINION MINING FROM RELEVANT WEB SERVICES

Kamps and Marx (2002) used WordNet relationships to produce values for three
parameters on each candidate OW on the text: evaluative (good-bad), activity (active
or passive) and potency (strong or weak). Minimum Length Path (MPL) is measured
between seed words and words in question.

Other methods include the usage of techniques like Naïve Bayes, ME (Maximum
Entropy) and SVM. This can be combined with Natural Processing Language
techniques, which require a pre-tagged lexicon and a set of grammatical or syntactic
rules.
(27)
On it is noticed that the polarity of a paragraph is strongly affecting the polarity
of a sentence in it. Thus, after estimating the polarity of every sentence, it then
estimates the polarity of each paragraph in order to produce a better classifier. In
general, the context of a sentence can affect the polarity determination of it via a
related variable.

On (7), they use a First Sentence Feature as they think that usually states the overall
opinion of the author. This variable could affect the decided polarity for phrases
thought as neutral.
(26)
An interesting notice was made on were researchers found that PROS reviews
results are more accurate than CONS reviews because on the first ones reviewers use
more explicit terms.
(19)
On , researchers introduced relaxation labelling for finding the semantic
orientation of words in context. Relaxation labelling is a commonly known iterative
procedure. The variables that the researchers used were:

• a set of objects

• a set of labels

• initial probabilities for each objects possible labels

• the definition of an object’s neighbourhood

• the definition of neighbourhood features

• the definition of a support function for an object label

The algorithm passes iteratively from the objects, re-evaluating it each time with the
new neighbourhood scores. The process stops when it has reached, for several cycles,

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 70


OPINION MINING FROM RELEVANT WEB SERVICES

the same score. Their research concluded on the development of a relevant Opinion
Mining Tool called OPINE.

Unsupervised methods do rely on the quality of the lexicon created. The methods for
creating such a lexicon were described on chapter 5.1.

If the analysis is done on feature level, then initially features words should be spotted
and relations between Opinion Words and specific Feature Words should be made. A
feature can be a”part-of” topic (such as the screen of a camera), or a property of the
part-of a topic (such as the size of a camera) (Hu and Liu,2004).

On (21), researchers used Turney’s PMI score, Osgood’s semantic values as well as
syntactic features – like the proximity OW-FW - as parameters for a SVM This
managed to obtain better scores than SVM with a “bag of words” approach.

Relations between Opinion Words and Feature Words can be done for words
belonging on the same phrase. However, some researchers search for pairs on n-
phrases distance in order to enhance their results. For example, an approach to tackle
coference issue that is analyzed afterwards is to search for an FW on phrases before, if
one does not exist on the phrase that an OW exists. So, a pair can be established if it
exists into a sentence or into a window of sentences (7).

In general to obtain pairs, method of adjacent adjectives is used, based on the theory
that usually a product feature and its corresponding opinion word co-exist on a
specific distance. In this way. Opinion Units (7) are established.
(8)
On , researcher based his analysis on grammatical rules in order to track pairs of
OW-FW. He recognized explicit pairs and implicit pairs. Explicit pairs were the ones
that they co-existed on the same sentence S. However, because on some sentence
many OW and FW exist, he used further grammatical analysis in order to establish
successfully the pairs. With the help of Stanford Parser he used the parameters below:

• shortest path between OW and FW

• POS Analysis

• Relation Sequence

• Negation Words

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 71


OPINION MINING FROM RELEVANT WEB SERVICES

By keeping the most frequent such sequences, he established in this way dependency
template syntax that he later used in order to establish pairs.

On (6) researchers created a “background words” list. This can be, for example on texts
discussing about “i-phone” the words “i-phone” and “Apple”. By isolating those
words from the analysis, they allowing their classifier to concentrate on real cases of
words that affect the polarity of the text

For implicit pairs, he established manually an index of pairs to identify them.

On (31) four different approaches were tested in order to spot pairs:

• full sentences,

• words between the opinion holder and the topic

• region 2 +/- two words

• from the first word behind the holder to the end of the sentences.

The fourth outperformed the others.

Methods can be divided on single-layer and dual layer. As an example, on (7) , a dual-
layer method was proposed on which the first layer classified a string as an Opinion
Phrase. If this test is successful, then the OW-FW is tested if it is valid on a second-
level layer.
(27)
Such a similarity-based approach was also introduced on in which tested
expressions are compared syntactically with template annotated expressions. When
similarities are found, then parts of test expression is assigned according to pre-
annotations on template ones.
(34)
uses a same intuition in order to find the polarity of a sentence. The intuition is
that objective sentences will have more similar structure to objective sentences than to
subjective, within a given topic.
(16)
On , researchers created trees with paths between a Topic word and a Sentiment
Word, for hotel reviews. They assumed that there is a conceptual relation between a
topic and a sentiment if they co-occur on a certain distance threshold. After a relation
is established, they check for a negation word on the path.

Negation Words (no, not, but e.g.) should take part on the analysis as they modify the
polarity of a OW-FW pair. They are considered valence shifters. One can search for

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 72


OPINION MINING FROM RELEVANT WEB SERVICES

modifiers in the path between the Opinion Word and the Feature Word or
alternatively on a n-window of phrases around the Opinion Word (as done on (9)).
(18)
presents the term of Class of Feature. By identifying 3 types of “influential
keywords” for each theme: Principal, Complementary and auxiliary they apply a
different level of influence for each word.

Gradability also affects the strength of the opinion. Gradability is “..the semantic
property that enables a word to participate in comparative constructs and to accept
(17).
modifying expressions that act as intensifiers or diminishers.” Its presence alone
is a good indicator of subjectivity. Hatzivassiloglou and Wiebe (2000) used two
indicators for spotting gradability: the presence of modifiers indexed manually on a
list (little, very, somewhat etc.) and the presence of inflected forms of adjectives. A
log-linear statistical model used those two indicators in order to produce a final
decision about document’s gradability.

Appraisal Theory. Whitelaw et al. instead of Opinion Words Appraisal Groups as


units of evaluation. Evaluation groups are “..coherent groups of words that express
together a particular attitude, such as “extremely boring or “not really very good”
(25)
. Each appraisal group has specific semantic values and at least, the Appraiser, the
Appraised, the Appraised Type and the Orientation. The paper extract adjective
Appraisal Groups which are a good sign of subjectivity. Appraisal groups are
following specific taxonomies according to four attributes: Attitude, Orientation,
Graduation, Polarity. By following this theory, researchers are able to handle phrases
“truly a really horribly” Appraisal denotes how language is used to adopt or express
an attitude of some kind towards some target

Then they find the frequency of Groups of Word with specific elements. Results
showed that a “Bag-of-Words” outperformed each of individual feature sets as those
were created from Appraisal Theory. However, combining “Bag-of-Words” with
appraisal theory prodcuced an accuracy of 90,2 %

Concluding, parameters that are taken into consideration for Opinion Mining Analysis
are:

• Polarity of a known word

• Strength of Opinion of a Known Word

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 73


OPINION MINING FROM RELEVANT WEB SERVICES

• Known Pairs of Opinion Words – Feature Words

• Part-of-Speech properties

• Syntactic Patterns

• Known linguistic and semantic properties of words from lexicons like


WordNet, SentiWordNet, General Inquirer.

• Gradability (e.g. very, more)

• Modifiers (Negation)

• Known polarity of the context

• Known Polarity of the document (e.g. First Sentence Rule)

• Syntactic Similarity of the string with strings of known polarity

• Co-presence of this word with known seed words (e.g. PMI)

• Other heuristics (frequency of occurrence, proximity with opinion word…)

Sentiment Analysis: Basic Methods Index

Bag-of Words Approach, Pang et al.

Conjunction Theory, Hatzivassiloglou - McKeown

Synsets Usage (WordNet), Kamps - Marx

Glosses Analysis (SentiWordnet), Esuli - Sebastiani

Relaxation Labeling, Popescu - Etzioni

PMI Method, Turney

Syntactic and Semantic Heuristics (Various)

Statistical Models (Naïve Bayes, SVM)

Appraisal Theory, Whitelaw

Figure 3.3: Index of basic methods for Sentiment Analysis

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 74


OPINION MINING FROM RELEVANT WEB SERVICES

3 . 3. 8 V is u a li za t ion / S u m ma ri za t i o n
Visualization is extremely important for all tools that has an objective to represent
information. The information should be aggregated and presented on the end user in a
way that he can draw conclusions. A lot and not well designed data representation will
have the same result as too little data: a user not well informed.

Outputs of summarization process could be:

• A textual summary which consists of its most important clauses (20), (40). The
task of this technique is to create a small text that is a comprehensive
summary of a document. Even this is a technique used primarily on Single-
(4),
Document Summarization but it can also on a Multi-Document
Summarization (Opinion Mining is a Multi-Document Analysis). In the case
of the later one, the output of the system can be the sentiment of a document
and its title (Ku et Al). Title is thought to be a good summary of a text.

• Simple Opinion Sentences that describe the polarity for features. This is
useful when a Features’ Level Analysis is done.
(30)
• Scaling (binary or multi-scale). On , researchers tackle the issue of
presenting results not a two-fold scale (bad/good) but on multi-step scale (eg
five stars rating system). Making experiments with human, they concluded,
that human are good on up-to-4 grades scales. On bigger scales only 5% of
the documents where placed on the most negative or most positive scale. So,
they defined their problem as four-class. They used e different methods: One
vs all, Regression and Metric Labeling.

• Numeric or Statistical Figures like the percentage of positive reviews or the


number of those. The first parameter is characterized bounded (4) because the
sum of all its cases (e.g. positive and negative) should be always the same
(100 for percentage). The second case is unbounded because the sum of all
the subcases does not have a limit. Even though the second kind of
summarization is helpful, in a meaning that 50% percent of negative opinions
could mean 2 texts or 200, readers comprehend better the bounded one. For
example, “..when eBay switched to display the percentage of pieces of
feedback on sellers that were negative, then negative reviews began to have a

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 75


OPINION MINING FROM RELEVANT WEB SERVICES

(4)
measurable economic impact..” . Hence, it is used for a summarization
system to include both kind of data.

• Visualization via graphical representation. Regular histograms, bars or pies


can be used in order to show the results of the analysis. Those graphs again
can be bounded or unbounded. Visual effect can be also applied on verbal
summarization, e.g. giving red colour on an opinion word if its negative and
green if its positive. Timeline could be also an extra factor represented on a
graph. This gives a hint on the trend of opinions regarding object and its
features.

3 .4 C H A L L E N G E S - L I M I TAT I O N S
The research done up to now, as described above, has managed to bring to develop
methods that achieve significant rates of success (up to 90%). However, the
characteristics and polymorphism of human expression can not be easily predicted
and analyzed from machine. For example, sarcastic phrases are very hard to be
analyzed correctly from an algorithm. Researchers tried to minimize the errors
produced by this linguistic phenomenon by taking into consideration the polarity of
context phrases. Of course, this can be only partially successful.

Other challenges for Opinion Mining Analysis are:

• Coreference problem or Anaphora Resolution. It is a usual phenomenon on


written texts, substitute words to be used when the same notion is included on
different phrases. For example: “Canon Pixma 300 photo quality is the best
that a photo camera of its category offers. They are clear and colourful.” The
second phrase does not include the Feature Word “Quality” on which the
Opinion Words refer to. Instead of it word “they” is used. As the Feature
Word and the Opinion Word are not on the same phrase, the pair is not
directly identifiable. It is implicit. It is shown, based on manually annotated
data, that opinion mining results can be improved by 10% if coreference
resolution is used (Nikolov et al , 2008)

Various intuitions are used to tackle this issue. On (38) candidate feature words
are identified based on the observation that, when the focus shifts from one

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 76


OPINION MINING FROM RELEVANT WEB SERVICES

feature to another, the new feature is often expressed using a definite noun
phrase at the beginning of the next sentence
(11)
On , two different notions are introduced: firstly sentiment consistency
which means that human tend to express the same sentiment on two phrases on
the row (if a different object is not clearly stated). Then, they use only pairs of
OW-FW that are known to be linked. For example, opinion word cheap cannot
be linked with Feature picture quality even if coreference resolution
algorithms are shown such. Researchers used Supervised Machine learning
Techniques, in which a pairwise function is use in order to predict whether
two noun phrases are coreferent. ”Subsequently, when making coreference
resolution decisions on unseen documents, the learnt pairwise noun
phrase coreference classifier is run, followed by a clustering step to produce
the final clusters (coreference chains) of coreferent noun phrases.”

On (16), the problem was addressed with the help of MARS System.

• Ambiguous words, especially on domain-dependent terms. This is a main


challenge, tackled with techniques described above. This has also being stated
as Term Senses, thus different senses of the same term have different
(38)
orientation on different context. On , researchers try to tackle the situation
on which a subject term is not expressed implicitly by searching for words that
are usually used on the context of this term. They build a disambiguation
process.

• Implicit feature-opinion pairs, which can be a case of coference


phenomenon (it will be discussed later) or of very short sentences (e.g.
“Great!”). On both cases feature word is missing making it difficult to assign
the Opinion Word and decide the final polarity.
(12)
• Dialect. On , researchers using Twitter messages faced the fact that those
messages are usually written on a different dialect that should be analyzed
separately. A special lexicon development could be required.

• Multi-Word expressions, that can be the usage of a chunk of text or phrase to


express an Opinion Word or a Feature Word instead of just one word.
Appraisal Theory is tackling this issue (Martin and White 2005)

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 77


OPINION MINING FROM RELEVANT WEB SERVICES

(26)
• Usage of Synonyms, on researchers use the first two words on WordNet
for each future to produce a list of synonyms for each word on their lexicon.
Other correspondent lexicons can also be used to extend the initial one

• Conditional Proposition, (e.g. affordable prices if you have a fat wallet)


• Not opinion phrases, containing an opinion word (e.g. “a good deal of”)
• Not negation, but has negation word (e.g. Not only…but also)
• Not contrary but has a but (e.g. Not only… but also)
(31)
• On embedded factual information, expressed with words that could be
misinterpreted as opinion-bearing ones could affect the decision process. For
example “this is film’s most horrific sequence” was defined from their
classifier as negative, although it is an objective statement

• Thwarted expectations where emotive effect is attained by emphasizing the


contrast between what the reviewer expected and the actual experience. or a
(20)
good actor trapped on a bad movie phenomenon, discussed on . Thus,
the emotions conveyed on the parts is not accurate. As Turney noticed: “the
whole is not always the sum of the parts”.

• Long Time for parsing. Analyzing, in a linguistic perspective, a great number


of documents, many times iteratively, requires computational power which is
translated into a lot of time on parsing. On (5), researchers noticed that
“…the long parsing times are the consequences of using a scripting language
for the development and testing of the parser. The results should reduce by
a factor of several tens or even hundreds if the parser was implemented
on a natively compilable language…”

3 .5 S P E C I A L I S S U E S
3 . 5. 1 U s e o f On t o l o g ies
Web 2.0 is here, Web 3.0 is coming. In the new era of Internet evolution, the objective
is for the data that exist on it to carry much more semantic information about them,
making it possible for machines to understand more of their qualities that they do
today. A simple example would be a search engine which, when you query it with a
phrase, it does not return you just the texts that contain that phrase, but all the links

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 78


OPINION MINING FROM RELEVANT WEB SERVICES

that seem relevant to it (Wolphram Alpha search engine is a meta-search engine trying
to do that).

Semantic is based on extensively tagging with metadata and on taxonomies which


are domain-dependent schemas of relations, between the objects of the same domain.
The relationships that can be made between objects are multiple than today and,
hopefully, making it easier for the machine to understand the ones that only humans
can now.

Opinion Mining do bear also the difficulties of an analysis of human understandable


notions, meaning the language and the dialect that he uses on his everyday
communication. Up to now, we have seen that a crucial step is to build a successful
lexicon of Opinion Words and Features. Semantic Structures and hierarchical
organization of objects on a domain can help on this by building further semantic
relationships.
(33)
On the development of a domain ontology is proposed. Using that kind of
structure allows to analyze more types of relationships between objects and their
features, than the “a-part-of” that is merely used. Researchers notice that with the
development of an ontology, the extraction of implicit and explicit relations will be
more successful. Moreover, they notice that with traditional Opinion Mining
structures there is a lack of organization which lead to summarizing “a bag-of-
features”. Methods that were based on the development of taxonomies, created better
results by reducing the number of extracted features. Ontology is an evolution of
taxonomy by being able to represent more types of relationships than “is-a-part” (like
synonymy).

No relationships Taxonomies Ontologies


between features

All of features are “is-a” relations are Richer types of


visualized established on tree- relations are
independently. No like structures. established. Implicit
relation or grouping Grouping, reduction types of relationship
can be established of features shown are mined out.
and visualization on
different levels can
be done
Figure 3.4: Evolution of Opinion Mining from “bag-of-features” to Semantic

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 79


OPINION MINING FROM RELEVANT WEB SERVICES

Analysis using Ontologies, improved:

• Features extraction, as it creates new links – not existing at the initial lexicon –
using ontological relations.

• Pairing between Features and Opinions, especially on implicit pairs, as it


creates a more complex tree of relations between domain’s objects

• Visualization and summarization, as it improves grouping of features

On (22), researchers are using RDF for interoperability and integration with the rest of
MUSING Business Intelligence Suite. MUSING models in a Business Intelligence
Ontology subjective information such as reputation and reliability. Researchers are
taking advantage of this Ontology in the context of developing a suite which creates
an accurate picture of a Business Entity over time, mining relevant opinions.
Results of their analysis are tagged according to MUSING Ontology in order to be
able to use the relations of it on further analyzing and summarizing them
(37)
On researchers are developing a marketing tool that will allow user to better
understand its market. Their approach is based on Semantic Web Technologies, using
SIOC (Semantic-Interlinked Online Communities) – a vocabulary to represent
discussions and posts - and linking their data with public semantic tagged
datasets like DBpedia. The novelty of that work, is the definition of an Opinion
Mining Ontology. This ontology can work as an universal intermediate between the
domain under analysis each time and the Opinion Mining itself.
(44)
Further work on that field has emerged Sentic Computing is a new paradigm,
evolution of traditional Opinion Mining Techniques using instead of statistical
learning models common sense reasoning tools and domain-specific Ontologies. This
approach identifies the disadvantage of traditional OM tools, that cannot handle
effectively that fact and opinions expressed are context and domain dependent. Sentic
Computing applications include AffectiveSpace, Hourglass of Emotions, Human
Emotion Ontology, OMCSentics and SenticNet

3 . 5. 2 O p in i on M in in g an d S o c ia l W eb
Social Web has introduced a new challenge for Opinion Mining. Production of
information from 2.0 applications is huge. Twitter has emerged as the “news media of

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 80


OPINION MINING FROM RELEVANT WEB SERVICES

the new era”. Data are produced in real-time in such a pace that traditional search-
engines (even behemoths like Google) are facing issues on indexing them. On the
other hand, as more people use that stream to publish their opinions, it becomes
very important for Opinion Mining Tools to be able to exploit them, taking moreover
into consideration that is a free-of-charge dataset.

On (12), researchers analyzed correlated the results of surveys conducted on confidence


and Political Opinion over the 2008-2009 period with the frequency of existence of
sentiments words on Twitter messages. They found a positive relation to exist in
specific datasets.

Their analysis took into consideration only messages that contained explicitly a topic
keyword. They establish a simple ratio between negative and positive messages. A
negative message was any message containing a negative word (as this was
established by OpinionFinder subjectivity lexicon.) and the as positive any message
containing a positive word. Even though, researchers admit that the method they use
is a baseline one, with a high error rate, they assume that their large number of
sentiments will smooth the noise. Moreover they use the notice of Hopkins and King
(2010) that using text analysis techniques could be inaccurate, when the objective is to
assess population proportions.

Results of the analysis showed that, analyzing tweets using this method, even though
not a safe one, can be used as an indicator of the public opinion trends.

In general, the analysis of that stream has its own challenges. Indexing real-time data
is a challenge on which real discussion and progress is done in our days (e.g. the
development of application on Hadoop database). However, no further discussion will
be done on this, as it is out of the scope of the present. Concerning linguistic analysis
a part that should be taken into consideration is the short messages found on new Web
(1)
Services (like YouTube, Twitter etc.) as noted on . Grammar and spelling lack of
concern, abbreviations (eg. Lol) and emoticons are characteristics of that group of
content. As this kind of data frequently disobeys general linguistic rules, development
of a special lexicon and rules could be required.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 81


OPINION MINING FROM RELEVANT WEB SERVICES

3 .6 C O N C L U S I O N S
What was made clear on the last two chapters, is the evolution of Web, amongst other,
as an invaluable resource of information. Additionally, its much more different
character than traditional sources (virtually real-time and constantly updating, in very
big quantities, directly feedback from customers, low cost or even free) makes unwise
any decision for a company not to hear on it. Management should be able to “read”
that stream of information when evaluating business environment and taking
decisions.

As we talk, primarily, for unstructured data, the task of integrating them on a


relational database has not being proved easy. Broad research has being made on
analyzing all the phenomena and characteristics of spoken word that are
understandable from a human person (some times not even for him) but for sure not
for machine. Techniques from Web-Mining era are broadly used. Machine learning
techniques and statistic analysis of grammar and syntactic rules as well as the
presence of semantic oriented words was the primary effort to understand and
interpret the emotional load of a text.

Even though all methods used have limitations, the research seems promising. The
main effort is toward two objectives: increase the success rates of analysis – aka
identify correctly the sentiment orientation of a text – or minimize the effort for
making the analysis – e.g. by developing a domain-independent classifier that would
not rely on domain-specific learning procedure prior the analysis. On specific efforts,
the experiment results are impressive (over 90%) even though usually it is on a
domain-specific and controlled body. Moreover, as more lexicon resources are
available to users and techniques from AI and Semantic era are now broadly used
(Ontologies) it is very possible that current limitations will soon be overcome
creating eventually solutions whose error will be statistically insignificant.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 82


CHAPTER 4: APPROACH OF
BUSINESS INTELLIGENCE AND
WEB MINING METHODS FOR
E N H A N C E D T R AV E L S E R V I C E S

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 83


4 .1 I N T R O D U C T I O N
On the chapter following, a framework is described for a BPMS (Business Process
Management System) integrated with the value-chain of the company in order to
directly monitor and affect the business processes as the producer of value for
the company. Moreover, a semantic layer is applied in order to create an environment
on which the user can easily describe the domain of his business and create his own
reporting environment based on those rules. Moreover, a detailed description is
given for the Opinion Mining Module, as the exploitation of unstructured sources is
perceived as being the core of a fully “intelligent” company on out times.

All the above, are described in the context of a real Business Case of a Travel Agency
working on a SAP® Business One Platform. However, the framework is developed to
be globally applicable and independent of the underlying platform used or the domain
on which it operates. Even more, it is designed to easily adapt on those different
environmental parameters.

On the next paragraph, the Business Case is described followed by the description fo
the framework with emphasis on the Opinion Mining Module.

4 .2 B U S I N E S S C A S E D E S C R I P T I O N
The Business Domain of our case study situation is that of Travel. Travel Agency
works on a competitive environment on which a lot of active players. Moreover, as
every Greek company, it faces a declining market due to the financial crisis. A lot of
Internet platforms are ,also built and operate that create a globalized market with
strong competitors from abroad.

In that environment, information and knowledge is vital for the company. The prices
that it offers should competitive, so it should promote any special price offers from its
suppliers. Moreover, as their propositions are based on the uniqueness of the offer and
their experience on creating successful travels, they should/ know as much details as
possible for all hotel services they propose.

A basic department on the organization is the Sales Department. The employees of


this department are the ones that come in contact with potential customers, hear the

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 84


requirements and SHOULD offer proposals that fit to what the customer wants as well
as his financial limitations.

On the current situation, Salesmen do have access on a limited amount of information


produced by internal users for this reason. However, this is still small in comparison
to the amount of information produced in Web and that would give him the
opportunity to take into consideration much more real-time data and customer
comments before making his proposal.

The analysis done below is mainly done from the view:

1. the Sales Agent, as a critical part of the company that could clearly get
benefit from enhanced information.

2. The Management, which should monitor Business Processes through KPI’s


and have tools to query extensively the database in order to support strategic,
long-term decisions

4 .3 F R A M E W O R K D E S C R I P T I O N
4 . 3. 1 G e n era l Pr es en t at io n
The framework proposed for a Business Intelligence System is covering the
requirements of a BPMS. Value chain is on the centre of its operation. An initial step
on the implementation is the identification of Business Processes and of KPIs in
order to monitor them

On the other hand, a very important role on the system is taking the exploitation of
unstructured Web Resources. This is done using two methods:

• Presentation on a non-processed stream of news

• Integration into the database through Web-ETL methods

The system is supported by specific Ontologies that should enhance the capabilities of
visualization, querying the engine as well as making it more independent of the
domain on which each time it operates.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 85


On the sub-chapter below, a description of the framework for a BPMS system is
discussed. In light blue, are described specific notes from its implementation on a
Travel Agency System running on a SAP® Business One ERP platform

Figure 4.1: BPMS based on Ontologies, with Opinion Mining Sub-Module

4 . 3. 2 M o du l e b y M odu le A n a ly s is
ETL - Data Warehouse: A data warehouse architecture is used in order to integrate
all sources on a single place and ensure a single source of truth. The exact
architecture of the database and ETL methods details are out of the scope of the
present, and will not be discussed further

As SAP© Business One covers the implementation of all the business processes
(including CRM) all data are stored on a single relational database which takes the
role of a Data Warehouse. No further integration is required or ETL techniques
applied as ERP data sources are virtually the only ones. Web-ETL module and Opinion
Mining techniques are taking, instead, the primary role of integrating data from the
“outer world” that shake that balance with their lack of structure but also with their
constantly growing importance.

B I D M E T S 86
Web-ETL: This module handles the data that they are sourced on Internet. Mainly
unstructured data, they should be integrated on the structured form of a relational
databases. The sources for such this module are:

• pre-designated URLs which are crawled regularly in order to draw data from
them. They are stored on the Data Sources Repository and they are mainly
used from the Opinion Mining module. This operation will be discussed
extensively on the next sub-chapter

• Web Services API. Many companies now have active pages on Facebook
through which they promote specific services or products of them. As this is a
major channel of communication with its customers, a company could drill
information using its API. Those sources are mainly used for the Web
Mining module. As this is out for the scope of the present, no further
discussion will be done.

Those tools are primarily addressing to the operational level rather than the strategic
level of decision making.

On its implementation on SAP® Business One, the reporting results of promoting


events , so called Projects, is enriched. Company promotes special offers and events
through designated Event Pages on Facebook which have a measurable financial
result through the notion of project. That data can be enriched with data from
Facebook API with, for example, the number of friends that the page has in order to
better evaluate its success.

BPMS: Business processes are recognized and tied with the operation of the Business
Intelligence System. For each of those, KPIs are stored on the KPI Repository which
are monitored through Dashboards and Alerting System as a user’s feedback. If the
user observe deviations from the desired performance, he can then calibrate the
processes. The system tries to simulate a Closed-Loop BI system where system’s
performance is feed-backed into it as a data source for further analysis.

In next step, the implementation of BAM techniques should also be further


investigated as this will automate even more the calibration of processes according to
performance.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 87


On its implementation on SAP® Business One, KPIs’ are set for each one of the
business processes described above. Those KPIs’ are monitored with the use of
dashboards and alerting system

Analysis Layer: On the analysis layer, data mining techniques are used. Further
analysis on the exact techniques used are out of scope. What has to be mentioned here
regarding our design, is that the components of this layer should be delivered easily
on the Query Engine of the User, in order for him to apply them on the KPIs of the
system without technical knowledge.

Semantic Layer: A semantic layer is applied over the operation of the whole system.
The scope of this addition is two-fold:

• increase the capabilities of visualization and the development of new


combinations between the objects. This is the main reason for keeping a
Domain Ontology.

• Enable users setup and “describe” the basic components of the system (e.g.
KPIs, Business Processes, Keywords etc.) without having to know
programming language. This is the main reason for keeping a Business
Intelligence Ontology.

On the core of the system there is Business Intelligence Ontology. This describes all
the concepts on which BPMS is based which are:

• Business Processes, e.g. “Sales Process”

• KPIs, e.g. “Sales Orders Closed”

• Domain Class, e.g. “Hotel”

A pseudo-definition of those concepts is following

{Business Process}
{has name}
{has owner}
{has trigger}

{Key Performance Indicator}

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 88


{has name}
{measures business process}
{has database ind}
{has visualization type}

{Domain Class}
{has name}
{relates with business process}
{relates with domain class}
{has database ind}

{Database Ind}
{has name}
{has database position}

In a few words Database Ind is the connection point between the database and the
conceptual world of the domain. All the visualization and query engine is structured
around the notions of Business Processes and Domain Classes. The first ones are
measured with KPIs. All those Ontologies are defined from the user for the specific
domain for the system to operate.

View of Unstructured Data: The integration of unstructured data has some


limitations as the analysis can not fully simulate the human cognitive process. Twitter
messages, do have a special syntaxes due to their character limitations. It is wise that
unstructured information should be shown on the User Interface This view is an
informal stream of data, unprocessed, which is used in order for the person to
“catch” information and keep himself updated.

The view presents information for the URLs on the Data sources Repository. Also it
uses Keywords Repository in order to drill data using Twitter API

On its implementation on SAP® Business One, the Keywords Repository mainly


contains Hotels’ Names. The intuition is that the users should monitor possible tweets
that give real-time updates, opinions or comments on the hotels that they sell to their
customers.

Visualization Layer. Except of the Unstructured Stream of data, the visualization


layer comprises of four more components:

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 89


• Alerts and Dashboards. Components of the BPMS, they monitor the KPIs’ in
order to visualize business processes performance to the user, for actions to be
taken that affect the value-chain of the company. Referring both to strategic
and operational level

• Basic Reporting. It contains a number of reports, preset, that are developed


during the implementation phase in order to cover the initial reporting
requirements of the users. As they are preset, it is difficult to alter them or
create new ones with this method. Referring mainly to strategic level

• Query Engine. This mechanism aims on creating an environment through


which the user will be able to create his own queries from data stored on the
data warehouse. The intuition here is that the environment on which the user
works allows him to set the description of his domain, meaning the Business
Processes, the KPIs through which they are monitored, the keywords
repository etc. Virtually, the user is building the Domain Ontology on his
own. Into this inventory and using this Domain Ontology he can later pose
queries. This engine is based on Business Intelligence Ontology which is
global for all domains and exists on the core of the system. Referring mainly
to strategic level

4 .4 D E S C R I P T I O N O F O P I N I O N M I N I N G S U B -
MODULE
4 . 4. 1 I n t rod u ct i on
Opinion Mining is an important module on our proposal. The unstructured data from
the “outside world” are constantly taking a graver part on decision-making process as
it was discussed on previous chapters. The framework proposed takes advantage of
those sources in two different ways:

• Show streams of data on Users through User Interface. This visualization does
not aim on providing integrated, processed knowledge but only a quick view
on most current news.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 90


• Integrate data on company’s Data Warehouse in order to use them further on
Business Intelligence Analysis. Those procedures are included on the Web-
ETL module and are the main content of Opinion Mining

As Opinion Mining is a field of real interest for the years coming, a raw prototype has
being developed as a pre-mature stage of an operating solution. Its current status is
described below. The methods used are, in general, raw comparing to the latest
evolutions as they were described on Chapter 3, but it covers the objective of the
current thesis which was to trace the specific field and test the notions on an upper
level identifying issues which should be solved on a real product. The example
below is describing the appliance of the prototype on a Travel Agency. However, the
same prototype is designed to operate on other domains too changing specific
systems’ parameters, namely:

• Sources

• Lexicons

Analysis output may also need moderations in order to apply on the requirements of
the process it supports on each case.

The specific prototype is talking about the implementation of Opinion Mining


techniques into the context of a SAP® Business One ERP implementation. The
logic as well as the repositories of the module (e.g. lexicons and their properties) have
being built on SQL and specifically on MS-SQL environment as this is the DBMS
used by the ERP software. Furthermore, as the specific implementation was using
BusinessObjects® software as its Business Intelligence tool, the capabilities of this
software are used on the visualization stage. However, the scope of the present is to
develop and create a logic which is independent of the platform that will be used for
the visualization of the outputs.

(Note: on paragraphs following, there are in some points, in italic, improvement plans
that are scheduled to be done on the next phases of the development).

4 . 4. 2 B u s in es s C a s e
The basic task for a Travel Agent (Salesman) is to offer his potential customers Travel
Packages Offers that can be attractive enough for them to buy. Those packages, in

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 91


most cases, are not ready but they consist of custom services, which means that they
are shaped each time according to the parameters of each case. A right decision for
the package that will be offered is critical for the Sales Department as it increases the
possibility of the package to be bought as well as the potential customer
satisfaction level and, thus, the possibility for this customer to buy again.

The decision process is affected by:

• Customer needs. For example, a customer travelling with his children would
like a family hotel instead of one that is suitable for honeymooners.

• Quality of hotel services, on specific critical areas like rooms, location etc.

• Offers that will make the offer more attractive. For example, offer for free
nights.

• Other conditions that may affect the decision. For example, an increase on
dollar/euro rate may make rooms on hotels out of the zone of Euro expensive.

During the Sale Process, the Agent uses data from the ERP database: a quality rating
that is set internally for each hotel, information like the facilities that each hotel offers
and the size of the rooms as well as comments from customer survey conducted each
time with their return from the travel. Again, data about the prices and the offers of
each hotel are available to the user.

As much as the information held internally is valuable it is still limited. Travel is a


domain for which much content is created on the Internet that has to do with travellers
impressions and experiences. This represents active and updated content from a
much more broad customer base which can be integrated into the decision
process through Sentiment Analysis (as a Web-ETL process). Moreover, data that
do not have directly link with travel domain (e.g. a social up rise on a potential travel
destination) but affect the decision can also be presented to the user in order to offer
better propositions to the final customer.

Structure, operation and details about Sentiment Analysis module that are described
below.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 92


4 . 4. 3 D es cr ip t i on
The analysis us done in Feature Level, meaning that judgement is aligned with
specific features of the product. The analysis produces results on 4 different areas of
hotel services: Facilities, Room, Service, Location. Domain-dependent lexicons have
been manually developed for Features and Opinions as well as pairs between them.
The lexicons are held on the form of taxonomies with each one of the four areas
being connected with “is-a” type of connections with recognized Feature Words.
Each one of the feature words is again connected with Opinion Words. Each Opinion
Word holds two properties: Word and Polarity (Improvement Note: Gravity will be
set as a third OW property. A strong expression should have more gravity than a light
one)

The process for the lexicon development is as follows: Initially, feature words are
identified for each of the four areas of hotel services. Synonyms of them are searched
and added to the repository. (Improvement Note: As this is done now manually, an
integration wth SentiWordNet is planned). For each one of those identified Feature
Words, Opinion Words are identified and connected. The process to identify such
connections is to scan through a training corpus manually. (Improvement Note:
Instead of manual scanning, a supervised learning method is planned to be used on a
training corpus for the development of the lexicon. Glosses Theory and integration
with SentiWordNet will further extend the lexicon).

The analysis is mainly using syntactic heuristics in order to identify patterns that
reveal some kind of polarity, positive or negative, based on the two, manual
constructed, lexicons. The results are visualized and summarized on the four areas
mentioned above.

Opinion Holder Identification is not considered critical as –in general – opinion


texts are being analyzed on which the writer of the text is also the holder of the
opinion

Subjectivity Analysis, even proved to have positive effects on the precision of the
results if applied before the sentiment analysis, has not being used on the present.

Four successive phases are taking place as soon as the text is “delivered” to the
mechanism for processing, as shown on the figure below:

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 93


Figure 4.2: Opinion Mining Analysis Process

• Text pre-processing where corpus collected is divided into phrases and into
words. In order to identify phrases, a list of delimiters is used. “Phrase” is also
a property of each Word which is used to identify pairs of Opinion Words –

B I D M E T S 94
Feature Words on a later stage. Words are the first material for the analysis
on the later stage.

Stemming of words is applied on that stage. On this prototype brute methods


are used were from initial Lexicon, product candidate words are identified
(e.g. adding suffix –s on the end of each word in order to identify plural forms
of them). Improvement Note: As this method is resource-intensive, on next
stage, stemming tools like Stanford Parser will be used in order to identify and
use the basic form of words).

• Define Subjectivity. On this phase, Feature Words and Opinion Words that
are included on the two lexicons are recognized on the corpus. As the same
Opinion Word may have different polarity for different Feature Words, pairs
should be identified. The “phrase” property helps on identifying pairs. A
simple intuition is used, that OW and FW co-existing on the same phrase are
considered as pairs, if this is allowed from the lexicons developed.
(Improvement note: Extended Conjunction Theory will be used for identifying
more effectively pairs of words and their polarity. This is defined as the
extensions made through the years on the initial Conjunction theory
(Hatzivassiloglou, McKewon) studying the polarity transfer of words on the
two sides of an an adjective).

As soon as the pairs are recognized, the polarity for the specific pair is drown
from the database.

In order to tackle coreference issue the simple intuition used is that in order to
change the Feature Word on which the text refers, the new FW should be
mentioned explicitly. Thus, Opinion Words without explicit links with FW are
connected implicitly with those on previous phrases (if those pairs are
included in the lexicon.

• Post – moderators of subjectivity. In this phase, Negation Analysis is


conducted. Corpus is scanned for negation words as those are listed on a
related index. If such a word is found between an Opinion Word and a Feature
Word, polarity is reversed.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 95


Improvement Note. Gradability Analysis is planned to be applied also. An
adjective on superlative form should sign a stronger opinion than on on
regular form.

• Visualization / Summarization. The output of the analysis is integrated into


the ERP database in order for the Sales Agent to use them on the decision
process. The presentation of the information to the user is two-fold:

o Presentation on SAP® Business One Interface. Sales Agent, uses the


“Business Partner” form in order to choose a Hotel. This form already
contains information like the hotel facilities which help the User to
decide about his selection. Moreover, he can find into the same form,
the results from Customer Satisfaction Survey from previous travels.

Opinion Mining enriches this form by adding summarized results of


Web Sources Analysis. This is done as average evaluation on a 5-fold
scale for each on of the four service areas. More data, like the ranking
of the hotel in relation to the other hotels can also be shown.

Figure 4.3: Opinion Mining enhancements on Basic Hotel SAP® Business One Form

o Presentation through a pre-designated BusinessObjects® . A special


report has being developed through which Sales Agent can view in
summary the results of the analysis. The presentation means are:

B I D M E T S 96
 Average rating of hotel for each service area

 Ranking of hotel amongst all hotels for each service area, as


this can be a better indicator of the real level of service

 Star rating, which is set according to the ranking of the hotel

 Latest Comment, which pictures the latest feeling expressed


about the hotel

 Buzzwords. List of words mostly heard

Figure 4.4: Sample of report produced with BusinessObjects® software

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 97


The database schema of the Opinion Mining is pictured on Appendix 1.

4 .5 C O N C L U S I O N S A N D F U T U R E W O R K

The development done until now had as a product a raw prototype where specific
methods are applied in order to conduct the analysis. The modules of the system
described above are either covered roughly from the current ERP platform of are still
described on conceptual level (mainly on what has to do with the Semantic Layer).
The scope of the future work is to integrate all modules on a unified platform,
independent of any other underlying platform the company uses
Regarding the Opinion Mining sub-module, the objective was to show that the
unstructured data from Web sources can be integrated into the Data Warehouse of the
company and enhance decision process. This seemed to have being done, as the
products of Sentiment Analysis is affecting the Sales Process of the Company.

Current State:
Raw Prototype
Second
Product:
First Product Semantic
Layer

Figure 4.5: Current State and future development of Sentiment Analysis Module

However, the current state of the module cannot support productive operations as it is
difficult to scale. Due to time constraints some aspects have not yet reached the
planned status. Lexicons are developed and maintained manually. Stemming is done
using brute techniques that are resource-intensive. Some phenomenon, like multi-
word phrasing are not covered and subjectivity analysis is not implemented in order to
measure its effect on the results of the analysis. Those steps are planned to follow on
the next stage were a First Product will be developed. The module on this stage will

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 98


should be able to handle common syntactic heuristics as well as scale easy enough in
order to be able to operate on real productive environments.
On the stage of the Second Product a semantic layer is planned to be applied. With
this step taken, the product will be able to recognize more relationships between
objects analyzed. This will have a clear effect on the analysis and visualization stage
where more complex matrixes and networks will be presented to the user. Moreover,
on this stage and with the use of semantic layer, events that has no direct relationship
with the domain (like social, financial etc.) will be included on the analysis.
On the figure below, the functionalities of a model Opinion Mining tool are shown
next to the current status of each of them.

Figure 4.6: Current State and next steps of Sentiment Analysis Module, Functions View

B I D M E T S 99
CHAPTER 5: CONCLUSIONS

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 100
5 .1 C O N C L U S I O N S
Business Intelligence will continue evolving into a more important part of Business
Operation as more data from more sources are becoming available and in lower cost.
Moreover, their integration with Business Process Management theories are making
them more valuable to the core business as they interact directly with the Value Chain
of a Company.
The framework discussed, even still mainly on theoretical level, presents a solution in
accordance with the need of having on-date data from various sources integrated on
the same Data Warehouse, tagged properly using an efficient metadata repository and
use. Those data are designed to give the ability to the users, through a friendly
environment, to query them in order to produce results. Moreover, a semantic layer is
added in order to extend the types of connections that can be established and, thus the
analysis that can take place. All this, using a BI Ontology as the core of the system in
order to make its extensibility and switch ability from domain to domain easy.
The future work consists of working on the development of those modules in order to
test the ability of this integrated system to work on a productive environment and
really offer what is meant to: well-organized information in a friendly environment on
which user, processes and data co-operate effectively for the result.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 101
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 102
ANNEX I: OPINION MINING DB
SCHEMA

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 103
BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 104
ANNEX II: REFERENCES

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 105
[1] Sentiment Strength Detection in Short Informal Text, Mike Thelwall (WASSA 2010).

[2] Evaluation and Extension of a Polarity Lexicon for German, Clematide & Klenner (WASSA 2010).

[3] Old Wine or Warm Beer: Target-Specific Sentiment Analysis of Adjectives, Fahrni & Klemner

(WASSA 2010)

[4] Opinion Mining and Sentiment Analysis, Bing Liu and Lilian Lee.

[5[ A Review on Natural Language Processing in Opinion Mining, Bhattacharyya et al

[6] Opinion Integration Through Semi-supervised Topic Modeling, Lu & Zhai

[7] Tourism Related Opinion Mining, Lin & Chao

[8] Movie Review Mining and Summarization, Zhuang et al

[9] OpAL: Applying Opinion Mining Techniques for the Disambiguation of Sentiment Ambiguous

Adjectives in SemEval-2 Task 18, Balahur & Montoyo

[10] Determining Term Subjectivity and Term Orientation for Opinion Mining, Esuli and Sebastiani

[11] Resolving Object and Attribute Coreference in Opinion Mining, Ding and Liu

[12] From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, O’Connor et al

[13] Opinion Mining and Strategic Decision Making: Application of Priority-Pointing Procedure in a

MIS-Based Project in Xi’an, China, Du et al

[14] Opinion Mining Classification Using Key Word Summarization Based on Singular Value

Decomposition, Valarmathi et al (International Journal on Computer Science and Engineering)

[15] Comment Extraction from Blog Posts and Its Applications to Opinion Mining, Kao et al

[16] Discovery of subjective evaluations of product features in hotel reviews, Pekar & Ou (Journal of

Vacation Marketing)

[17] Learning Subjective Adjectives from Corpora, Wiebe

[18] Identifying Themes in Social Media and Detecting Sentiments, Pal & Saha

[19] Extracting Product Features and Opinions from Reviews, Popescu & Etzioni

[20] Thumbs up? Sentiment Classification using Machine Learning

Techniques, Pang et al

[21] Sentiment analysis using support vector machines with diverse information

Sources, Mullen & Collier

[22] Opinion Analysis for Business Intelligence Applications, Funk et al

[23] Opinion Mining of Customer Feedback Data on the Web, Lee et al

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 107
ANNEX II: REFERENCES

[24] Constructing Thai Opinion Mining Resource: A Case Study on Hotel Reviews, Haruechaiyasak et

al

[25] Using Appraisal Groups for Sentiment Analysis, Whitelaw et al.

[26] Opinion Observer: Analyzing and Comparing Opinions on the Web, Liu et al

[27] Incorporating Feature-based and Similarity-based Opinion Mining, Xu & Kit (CTL in NTCIR-8

MOAT)

[28] A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on

Minimum Cuts, Pang & Lee

[29] Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text, Kim &

Hovy

[30] Seeing stars: Exploiting class relationships for sentiment categorization with

respect to rating scales, Pang & Lee

[31] Movie Review Mining: a Comparison between Supervised and Unsupervised

Classification Approaches, Chaovalit & Zhou

[32] A Hierarchical Approach to Wrapper Induction, Muslea and al

[33] Ontolexical resources for feature based opinion mining:a case-study, Oltramari et al (6th Workshop

on Ontologies and Lexical Resources)

[34] Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the

Polarity of Opinion Sentences, Yu & Hatzivassiloglou

[35] A survey on sentiment detection of reviews, Tang et al

[36] Web Data Extraction Based on Partial Tree Alignment, Zhai & Liu

[37] Towards Opinion Mining Through Tracing Discussions on the Web, Softic & Hausenblas

[38] Sentiment Mining in WebFountain, Yi & Niblack

[39] Hidden Sentiment Association in-Chinese Web Opinion Mining, Su et al

[40] Sequential Models for Sentiment Prediction, Mao and Lebanon

[41] Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns, Choi et.

al

[42] Using Emoticons to reduce dependency in machine learning techniques for sentiment

classification, Read

[43] Determining the Sentiment of Opinions, Kim and Hovy

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 108
ANNEX II: REFERENCES

[44] Thumbs Up or Thumbs Down? Semantic Orientation applied to unsupervised classification of

reviews, Turney 2001

[45] Senting Computing, Merging AI and Semantic Web Techniques for Opinion Mining and

Sentiment Analysis

[46] Avatar Semantic Search: A Database Approach to Information Retrieval, Kandogan et al

[47] Open Information Extraction from the Web, Banko et al.

[48] Predictive Analytics: the Future of Business Intelligence, Zaman

[49] A Visual Framework for Knowledge Discovery on the Web: An Empirical Study of Business

Intelligence Exploration, Chung et al.

[50] Discovering Business Intelligence Using Treemap Visualizations, Ben Shneiderman, 2006

[51] Business Intelligence 2.0: Are we there yet? Gregory S. Nelson, 2010 (in the context of 2010 SAS

Forum)

[52] http://money.cnn.com/news/newsfeeds/articles/marketwire/0589146.htm

[53] http://www.b-eye-network.com/view/10275

[54] Key Issues for Business Intelligence and Performance Management Initiatives, Gartner

[55] A Comparison of Business Intelligence Strategies and Platforms, (Green Hill Analysis, 2002)

[56] Practical Considerations for Real-Time Business Intelligence, Donovan Scheider, (Yahoo

Strategic Data Solutions, 2006)

[57] The Current State of Business Intelligence, Hugh J. Watson-Barbara H. Wixom

[58] The Reality of Real-Time Business Intelligence, Divy Agrawal

[59] Magic Quadrant for Business Intelligence Platforms, Gartner

[60] Beyond Data Warehousing: What’s Next in Business Intelligence, Golfarelli-Rizzi-Cella

[61] Real-time Business Intelligence: Best Practices at Continental Airlines, Watson et al.

[62] The BI Watch: Real-Time to Real Value, Richard Hackathorn (DM Review, 2004)

[63] PUZZLE: a concept and prototype for linking business intelligence to business strategy, Rouibah –

Ould-ali, 2002

[64] Techniques, Process and Enterprise Solutions of Business Intelligence, Zeng et al. (2006 IEEE

Conference on Systems, Man and Cybernetics)

[65] Business Intelligence Systems: A Comparative Analysis, Dell’ Aquilla et al.

[66] Real Time Business Intelligence for the Adaptive Enteprise, Azvine et al.

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 109
ANNEX II: REFERENCES

[67]Business Intelligence Explorer: A Knowledge Map Framework for Discovering Business

Intelligence on the Web, Chung – Chen - Nunamaker Jr.

[68] Map Displays for Information Retrieval, Lin

[69] Enhanced Business Intelligence – Supporting Business Processes with Real-Time Business

Analytics, Seufert-Schiefer

[70] The Next Generation of Business Intelligence: Operational BI, Colin White, 2006

[71] The Measurement of Business Intelligence, Lonnqvist – Pittimaki

[72] Approach to Building and Implementing Business Intelligence Systems, Interdisciplinary Journal

of Information, Knowledge, and Management , Olszak – Ziemba, 2007

[73] Adding Semantics to Business Intelligence, Sell et. Al

[74] Aligning Process Automation and Business Intelligence to Support Corporate Performance

Management, Melchert et al.

[75] Integration of Business Intelligence based on Three-Level Ontology Services, Cao et al.

[76] A Service-oriented Architecture for Business Intelligence, Wu et al.

[77] An Exploratory Cognitive Business Intelligence System, Niu et al.

[78] Web Business Intelligence: Mining the Web for Actionable Knowledge, Srivastava-Cooley

[79] Business Intelligence Systems in the Holistic Infrastructure Development Supporting Decision-

Making in Organizations, Olszak – Ziemba

[80] A Visual Framework for Knowledge Discovery on the Web: An Empirical Study of Business

Intelligence Exploration, Chung et al.

[81] Web Data Extraction: the Lixto Approach, Baumgartner et al.

[82] Natural Language Technology for Information Integration in Business Intelligence, Maynard et al.

[83] Business Process Monitoring and Alignment: An Approach based on the User Requirements

Notation and Business Intelligence Tools, Poursahid et al.

[84] Business Process Intelligence, Grigori et al

[85] Business Intelligence, Negash

[86] A Business Intelligence System, Luhn (1958, IBM Journal)

[87] A Brief History of Decision Support Systems, Power

BUSINESS INTELLIGENCE AND DATA MINING FOR ENHANCED TRAVEL SERVICES 110

Das könnte Ihnen auch gefallen