You are on page 1of 14

MAHARAJA AGRASEN INSTITUTE

OF TECHNOLOGY

MINOR PROGRESS
REPORT
Business Intelligence Tool on Hadoop
Guided by:
Mr. Sandeep Tayal

Submitted by:
Himanshi Mehra(04414802712)
Labhansh Atriwal(01914802712)
Parth Nagar(01114802712)
1

CERTIFICATE AND DECLARATION


This is to certify that the minor project entitled A BUSINESS INTELLIGENCE TOOL ON
HADOOP FRAMEWORK being submitted by Ms Himanshi Mehra, Mr Labhansh
Atriwal and Mr Parth Nagar in partial fulfilment of the requirement for the award of degree
B.Tech - CSE of Maharaja Agrasen Institute of Technology, GGSIPU, is a record of
authentic work carried out by them under the guidance of Mr. Sandeep Tayal.

Place : New Delhi


Date

: 17th September, 2015

SIGNATURE OF GUIDE :

(Mr Sandeep Tayal)

ABSTRACT

Big data can mean big opportunities for organizations, but only if they can make sense
of the data in a reasonable amount of time. When we are talking about terabytes and petabytes
of information, often unstructured, generated by social networking, sensors, financial
transactions, mobile applications and so much more, this is no small task. The open source
Hadoop ecosystem of tools and technologies can help companies tackle the broad problem of
big data analytics.

Enter Business Intelligence (BI), a concept that has been around for decades but its
optimum utilisation on unstructured content has been frugal. BI allows for the easy
interpretation of these large volumes of data, identifying new opportunities and implementing
an effective strategy based on insights can provide businesses with a competitive market
advantage, long-term stability and drive business decisions.

Thus, the project is aimed to create a business intelligence tool which would work on
unstructured data that is being stored in Hadoop Distributed File System (HDFS).

INTRODUCTION
Corporate Performance Management, Customer Relations, Sales Performance, Customer
Churn Detection are some of the many examples of applications that can benefit from Business
Intelligence Tool.

The key general categories of business intelligence tools are:

Spread sheets.
Reporting and querying software: tools that extract, sort, summarize, and present selected data.
OLAP: Online analytical processing.
Digital dashboards.
Data mining.
Data warehousing.
Local information systems.

The rapid proliferation of unstructured data is one of the driving forces of the new paradigm of
big data analytics.

According to one study, we are now producing as much data every 10 minutes as was
created from the beginning of recorded time through the year 2003.
The preponderance of data being created is of the unstructured variety -- up to about 90%,
according to IDC [1].
The term Unstructured Business Intelligence describes methods and tools that enable data
warehouse applications to use unstructured information (i.e. information that does not have a
pre-defined data model or is not organized in a pre-defined manner).
With unstructured data crossing the size of Terabyte (or Petabyte) scale, there is a need to
process this data in a way which horizontally scalable. This is where the Hadoop Distributed
File System comes into play.
Hadoops main components are its file system (HDFS) that provides cheap and reliable data
storage, and its Map Reduce engine that provides high performance parallel data processing.
In addition Hadoop is self-managing and can easily handle hardware failures as well as scaling
up or down its deployment without any change to the codebase. Hadoop installs on low cost
commodity servers reducing the deployment cost.
Thus, we aim to build a Text-based Business Intelligence tool that works on unstructured
form of data. To enhance the performance of processing Big Data, Hadoop File System
HDFS and Map Reduce function is implemented.

LITERATURE REVIEW

WHAT IS BIG DATA?


Lately, its getting hard to put enough zeros on numbers that quantify the volume of
data our wired world generates. Current research estimates that our Facebook likes,
Instagram photos, YouTube videos and blog entries contribute to some 2.5 billion gigabytes of
data generated every 24 hours.

Much of the daily torrent of newly minted information is unseen. In addition to tweets,
pics and status updates, a deluge of data generated by RFID readers, sensor networks, logs and
countless other auto-reporting systems fills vast data pools.

Sensors on a single commercial aircraft produce 10 terabytes of data every 30 minutes


it is operating, says Alex Black, senior partner in Analytic Insights and Information
Management (AIIM) at CSC. Multiply that by all the aircraft in service around the world at
any moment thats a big number.

Thats Big Data. How that data is collected, curated and harnessed is a topic that once
interested only computer science researchers. Black describes four characteristics unique to big
data: volume, velocity, variety and veracity [5]. With equipment sensors pumping out petabytes
of data every hour, the volume and velocity of Big Data are easy to appreciate.

UNSTRUCTURED DATA
Unstructured Data (or unstructured information) refers to information that either does
not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts as
well. This results in irregularities and ambiguities that make it difficult to understand using
traditional programs as compared to data stored in fielded form in databases or annotated
(semantically tagged) in documents.

In 1998, Merrill Lynch [4] cited a rule of thumb that somewhere around 80-90% of all
potentially usable business information may originate in unstructured form. This rule of thumb
is not based on primary or any quantitative research, but nonetheless is accepted by some.

IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold
growth from the beginning of 2010. Computer World states that unstructured information
might account for more than 70%80% of all data in organizations.

WHAT IS BUSINESS INTELLIGENCE?


Business intelligence (BI) is an umbrella term used to encompass the processes,
methods, measurements and systems businesses use to more easily view, analyse and
understand information relevant to the history, current performance or future projections for a
business [3]. Other terms that people often use to describe BI include business analytics,
decision support and executive decision support.

The goal of BI is to help decision-makers make more informed and better decisions to
guide the business. Business intelligence software and software-as-a-service (SaaS) solutions
accomplish this by making it simpler to aggregate, see, and slice-and-dice the data. In turn, this
makes it easier to identify trends and issues, uncover new insights, and fine-tune operations to
meet business goals.

BI solutions can be very comprehensive, or they can focus on specific functions, such
as corporate performance management, spend analysis, sales pipeline analysis and sales
compensation analysis.

WHY SHOULD YOU CARE?


Results from the SMB Groups recently released survey, 2010 SMB Routes[2] to Market
Study reveal that SMBs view getting better insights out of the data they already have as their
top technology challenge. BI solutions can solve this problem by providing a framework and
tools to measure and manage business goals and conduct what-if scenarios to evaluate
different courses of action.

In very small companies, spread sheets and other ad hoc tools are often enough to get
the job done. But as companies grow, the amount of data decision makers need to understand
grows: new products and services, new markets and opportunities, investments in operations,
sales, marketing and other systems to support growth.

As a result, more people have to be part of the data collection and analysis process, and
different people in the organization (sales, marketing, finance, etc.) need to look at data in
different ways. Typical problems with the spread sheet approach include:

Time consuming and labour intensive to set up and maintain. Establishing a


company-wide model, creating organizational plans, distributing and collecting
information from different managers, consolidating multiple spread sheets, and
debugging broken macros and formulas becomes unwieldy.
6

Insufficient collaboration and feedback capabilities. Desktop spread sheets are


siloed, and dont enable real-time data sharing and updating. Getting a unified, accurate
view becomes difficult.
Error prone. Research shows that 20 to 40 percent of all spread sheets contain errors,
and as they become more complex, error rates multiply. Without an audit trail, changes
and mistakes can go undetected and businesses make decisions based on bad
information.
Inadequate analysis and reporting. Collecting information and cobbling it together
via spread sheets is cumbersome. The detailed information that decision-makers need
can be hard to get or not even available.

Business intelligence solutions give businesses a way to streamline and unify the data
collection, analysis and reporting process. BI solutions are built on a unified database, so
everyone involved in the process gets a single, real-time view of the data. Many BI solutions
feature self-service dashboards and reporting tools that make it easier and less time consuming
to contribute to and manage the process.

WHAT TO CONSIDER? TOOLS IN THE MARKET

Until recently, BI solutions have typically been too expensive and complicated for many
SMBs to use and manage. But more recently, vendors have made strides to make BI solutions
more tailored, accessible and affordable. For example:

Function-specific BI solutions. Many vendors have introduced software designed to


focus on the analytical needs specific to a particular department or process. By focusing
on a specific need, they can offer solutions that are simpler to use and more costeffective. For example, vendors such as Adaptive Planning and Host Analytics focus
exclusively on corporate performance management; Cloud9 Analytics concentrates on
helping companies manage sales performance; Xactly focuses on sales compensation
analytics; and Rosslyn Analytics addresses spend management and analysis.
Pre-packaged solutions within a broader BI suite. Companies that offer broad,
comprehensive suites that include BI, data warehousing and analytics capabilities have
been re-packaging their solutions to focus on specific needs. For instance, SAP
Business Objects Edge offers modules for planning and consolidation and for strategy
management and score carding; and Birst offers pre-packaged solutions for sales,
marketing and financials.
ERP and CRM companies providing pre-integrated BI solutions. Many ERP and
CRM vendors now offer pre-integration with BI solutions to reduce the time, difficulty
and expense of deploying BI to work with an existing system. Examples include
NetSuite, which partners with Adaptive Planning and MyDials and Salesforce.com with
Xactly for sales compensation management.
On demand, software-as-a-service (SaaS) BI solutions. The SaaS model removes IT
infrastructure costs from the BI equation, and it can dramatically reduce or eliminate
upfront capital costs. Many of the vendors mentioned above -- and others -- deliver
their BI solutions via a SaaS model.
7

Today, there are more BI choices geared for SMB needs and budgets than ever.
However, vendors characterize and target the SMB market differently, and these differences
are reflected in pricing, solution capabilities and complexity. Start with a thorough assessment
of our internal needs, and then carefully investigate and evaluate how different offerings map
to your organizational requirements and constraints.

TEXT IN BUSINESS INTELLIGENCE TOOL

Using text data for Business Intelligence comprises three steps:


1. Text analysis components that extract information with sufficiently high quality to be
used by automatic downstream processing.
2. Approaches to create efficient relational schemas for text analysis results that enable
both reporting and data mining.
3. Analytics components that can combine structured and previously unstructured data to
yield better results and more insight.

OBJECTIVE AND SCOPE

The project is aimed at

Creating a Business Intelligence tool that will work on unstructured data

A large amount of data will be collected using a web data scraper and stored in the
Hadoop file system called HDFS

The BI tool will run on a network of computers, where in the result will be produced
using the processing power of multiple computers

Data will be distributed to the processors by the Map-Reduce function

The tool will include statistical analysis, geographical and timeline representation and
sentiment analysis

PROPOSED APPLICATIONS

Improve Quality Early Warning: Internal problem reports, customer email or call
centre transcripts can yield valid information about emerging product problems. Today,
companies try to capture these insights using a fixed set of categories within problem
taxonomies. Such taxonomies typically suffer from granularity problems: if they
contain only high-level categories, they cant capture the actual reason for a problem.
However, if they try to capture all possible problems, they become too unwieldy to use
for front-line workers, who just stick to the categories they know (especially in a highstress environment such as a call centre). Thus, the actual reason for a defect is often
buried within technician comments or call centre logs. As a result, a company may
detect that there is a problem with a certain product, but doesnt know which part causes
the problem, and therefore cant take the right action: deciding on a product recall, or
checking other products that use the offending part.

Reduce customer churn: Companies in the telecommunication sector already have


elaborate predictive analytic models for customer churn, based on structured data.
However, once the customers unhappiness with a certain service shows up in the
structured data (for example, a decline in the number of long-distance calls made), it
may already be too late. By analysing each customer contact with the company, be it
9

email or call centre records, a company can earlier detect angry, unhappy customers, or
customers that explicitly reference a competitor, and include that into their churn
model. This allows for taking action at the first sign of customer discontent.

Reputation management: Blogs, news articles and consumer portals increasingly


affect customers buying decisions, especially for consumer goods. Analysing this
extranet content helps to answer questions like how do people talk about my company
and my products compared to my competition? Or what companies are associated
with cool technology? For example, hybrid technology in the automotive domain.
This can serve as an External Quality Early Warning system, as not every unhappy
customer bothers to write to the company, but whose forum entries may turn away quite
a few prospective customers.

10

METHODOLOGY

Step 1: Data Extraction using a Data Scraper


A web data scraper will be created to collect the data from the internet.
Data can be any unstructured data i.e. emails, texts, dates, addresses, numbers, images, audios
etc.
It will take the domain as input and will scrap data from each and every page of the website.
Considering mark-up tags : <p></p> paragraph tag, <td></td> column tag, <div></div>,
<a></a> hyperlink tag, <b></b> bold tag, it recursively scans every page on the domain and
extracts any texts under these tags. The extracted data is stored in text files.

Step 2: Creating a Business Intelligence Tool


Business Intelligence Tool would be created to analyse unstructured data to gain useful insight
which would be implemented using JAVA. The following things would be incorporated in it:

Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields.
In case of BI tool, words and their frequencies in the documents extracted is analysed.
The most frequent words are used for clustering. The attributes linked to these words are used
find similarities among words and cluster them.

Sentiment Analysis
Sentiment analysis (also known as opinion mining) refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract subjective
information in source materials. Sentiment analysis is widely applied to reviews and social
media for a variety of applications, ranging from marketing to customer service.
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer
with respect to some topic or the overall contextual polarity of a document. The attitude may
be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the

11

emotional state of the author when writing), or the intended emotional communication (that is
to say, the emotional effect the author wishes to have on the reader).
A basic task in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level whether the expressed opinion in a document, a sentence
or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity"
sentiment classification looks, for instance, at emotional states such as "angry," "sad," and
"happy."
In BI tool, sentiment analysis is used to determine customer reviews or opinions. This is done
by scanning every sentence from the end for sentimental words such as good. Every positive
word increments the overall sentiment while a negative decrements the overall sentiment. An
overall value less than zero means negative sentiment and vice versa.

Geographical and Timeline representation


This technique is used to map documents or texts with their geographic location and the time.
This analysis enhances the demographics of BI tool results.
Geo tagging on Google maps is an effective visual representation.

Step 3: Storing data in HDFS and processing data using Map


Reduce function
Data will be fed-in the Hadoop Distributed File System in the raw form. Every commodity
hardware will have this tool installed on it.
While processing the huge data, Hadoop Map Reduce function is deployed to distribute load
on parallel resources.
The output of all parallel resources is combined in the end.

12

PROGRESS

Work done as of today in part completion of the project has been stated below:
1. Technology Stack regarding the topic taken was prepared and studied thoroughly.
2. Data Scraper has been made and data extraction is under progress.

Given below is a progress chart representing the plan of action:

25 % complete
75%
remainin

13

REFERENCES

1. "Extracting Value from Chaos," IDC, June 2011


2. What is Business Intelligence, and Why Should You Care? Laurie McCabe
3. Russom, P.: BI Search and Text Analytics, TDWI Best Practices Report, 2007; pp. 911
4. Christopher C. Shilakes and Julie Tylman, "Enterprise Information Portals", Merrill
Lynch, 16 November 1998.
5. Beyer, Mark. "Gartner Says Solving 'Big Data' Challenge Involves More Than Just
Managing Volumes of Data". Gartner. Archived from the original on 10 July 2011.
Retrieved 13 July 2011.

14