Beruflich Dokumente
Kultur Dokumente
OF TECHNOLOGY
MINOR PROGRESS
REPORT
Business Intelligence Tool on Hadoop
Guided by:
Mr. Sandeep Tayal
Submitted by:
Himanshi Mehra(04414802712)
Labhansh Atriwal(01914802712)
Parth Nagar(01114802712)
1
SIGNATURE OF GUIDE :
ABSTRACT
Big data can mean big opportunities for organizations, but only if they can make sense
of the data in a reasonable amount of time. When we are talking about terabytes and petabytes
of information, often unstructured, generated by social networking, sensors, financial
transactions, mobile applications and so much more, this is no small task. The open source
Hadoop ecosystem of tools and technologies can help companies tackle the broad problem of
big data analytics.
Enter Business Intelligence (BI), a concept that has been around for decades but its
optimum utilisation on unstructured content has been frugal. BI allows for the easy
interpretation of these large volumes of data, identifying new opportunities and implementing
an effective strategy based on insights can provide businesses with a competitive market
advantage, long-term stability and drive business decisions.
Thus, the project is aimed to create a business intelligence tool which would work on
unstructured data that is being stored in Hadoop Distributed File System (HDFS).
INTRODUCTION
Corporate Performance Management, Customer Relations, Sales Performance, Customer
Churn Detection are some of the many examples of applications that can benefit from Business
Intelligence Tool.
Spread sheets.
Reporting and querying software: tools that extract, sort, summarize, and present selected data.
OLAP: Online analytical processing.
Digital dashboards.
Data mining.
Data warehousing.
Local information systems.
The rapid proliferation of unstructured data is one of the driving forces of the new paradigm of
big data analytics.
According to one study, we are now producing as much data every 10 minutes as was
created from the beginning of recorded time through the year 2003.
The preponderance of data being created is of the unstructured variety -- up to about 90%,
according to IDC [1].
The term Unstructured Business Intelligence describes methods and tools that enable data
warehouse applications to use unstructured information (i.e. information that does not have a
pre-defined data model or is not organized in a pre-defined manner).
With unstructured data crossing the size of Terabyte (or Petabyte) scale, there is a need to
process this data in a way which horizontally scalable. This is where the Hadoop Distributed
File System comes into play.
Hadoops main components are its file system (HDFS) that provides cheap and reliable data
storage, and its Map Reduce engine that provides high performance parallel data processing.
In addition Hadoop is self-managing and can easily handle hardware failures as well as scaling
up or down its deployment without any change to the codebase. Hadoop installs on low cost
commodity servers reducing the deployment cost.
Thus, we aim to build a Text-based Business Intelligence tool that works on unstructured
form of data. To enhance the performance of processing Big Data, Hadoop File System
HDFS and Map Reduce function is implemented.
LITERATURE REVIEW
Much of the daily torrent of newly minted information is unseen. In addition to tweets,
pics and status updates, a deluge of data generated by RFID readers, sensor networks, logs and
countless other auto-reporting systems fills vast data pools.
Thats Big Data. How that data is collected, curated and harnessed is a topic that once
interested only computer science researchers. Black describes four characteristics unique to big
data: volume, velocity, variety and veracity [5]. With equipment sensors pumping out petabytes
of data every hour, the volume and velocity of Big Data are easy to appreciate.
UNSTRUCTURED DATA
Unstructured Data (or unstructured information) refers to information that either does
not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured
information is typically text-heavy, but may contain data such as dates, numbers, and facts as
well. This results in irregularities and ambiguities that make it difficult to understand using
traditional programs as compared to data stored in fielded form in databases or annotated
(semantically tagged) in documents.
In 1998, Merrill Lynch [4] cited a rule of thumb that somewhere around 80-90% of all
potentially usable business information may originate in unstructured form. This rule of thumb
is not based on primary or any quantitative research, but nonetheless is accepted by some.
IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold
growth from the beginning of 2010. Computer World states that unstructured information
might account for more than 70%80% of all data in organizations.
The goal of BI is to help decision-makers make more informed and better decisions to
guide the business. Business intelligence software and software-as-a-service (SaaS) solutions
accomplish this by making it simpler to aggregate, see, and slice-and-dice the data. In turn, this
makes it easier to identify trends and issues, uncover new insights, and fine-tune operations to
meet business goals.
BI solutions can be very comprehensive, or they can focus on specific functions, such
as corporate performance management, spend analysis, sales pipeline analysis and sales
compensation analysis.
In very small companies, spread sheets and other ad hoc tools are often enough to get
the job done. But as companies grow, the amount of data decision makers need to understand
grows: new products and services, new markets and opportunities, investments in operations,
sales, marketing and other systems to support growth.
As a result, more people have to be part of the data collection and analysis process, and
different people in the organization (sales, marketing, finance, etc.) need to look at data in
different ways. Typical problems with the spread sheet approach include:
Business intelligence solutions give businesses a way to streamline and unify the data
collection, analysis and reporting process. BI solutions are built on a unified database, so
everyone involved in the process gets a single, real-time view of the data. Many BI solutions
feature self-service dashboards and reporting tools that make it easier and less time consuming
to contribute to and manage the process.
Until recently, BI solutions have typically been too expensive and complicated for many
SMBs to use and manage. But more recently, vendors have made strides to make BI solutions
more tailored, accessible and affordable. For example:
Today, there are more BI choices geared for SMB needs and budgets than ever.
However, vendors characterize and target the SMB market differently, and these differences
are reflected in pricing, solution capabilities and complexity. Start with a thorough assessment
of our internal needs, and then carefully investigate and evaluate how different offerings map
to your organizational requirements and constraints.
A large amount of data will be collected using a web data scraper and stored in the
Hadoop file system called HDFS
The BI tool will run on a network of computers, where in the result will be produced
using the processing power of multiple computers
The tool will include statistical analysis, geographical and timeline representation and
sentiment analysis
PROPOSED APPLICATIONS
Improve Quality Early Warning: Internal problem reports, customer email or call
centre transcripts can yield valid information about emerging product problems. Today,
companies try to capture these insights using a fixed set of categories within problem
taxonomies. Such taxonomies typically suffer from granularity problems: if they
contain only high-level categories, they cant capture the actual reason for a problem.
However, if they try to capture all possible problems, they become too unwieldy to use
for front-line workers, who just stick to the categories they know (especially in a highstress environment such as a call centre). Thus, the actual reason for a defect is often
buried within technician comments or call centre logs. As a result, a company may
detect that there is a problem with a certain product, but doesnt know which part causes
the problem, and therefore cant take the right action: deciding on a product recall, or
checking other products that use the offending part.
email or call centre records, a company can earlier detect angry, unhappy customers, or
customers that explicitly reference a competitor, and include that into their churn
model. This allows for taking action at the first sign of customer discontent.
10
METHODOLOGY
Cluster Analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields.
In case of BI tool, words and their frequencies in the documents extracted is analysed.
The most frequent words are used for clustering. The attributes linked to these words are used
find similarities among words and cluster them.
Sentiment Analysis
Sentiment analysis (also known as opinion mining) refers to the use of natural language
processing, text analysis and computational linguistics to identify and extract subjective
information in source materials. Sentiment analysis is widely applied to reviews and social
media for a variety of applications, ranging from marketing to customer service.
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer
with respect to some topic or the overall contextual polarity of a document. The attitude may
be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the
11
emotional state of the author when writing), or the intended emotional communication (that is
to say, the emotional effect the author wishes to have on the reader).
A basic task in sentiment analysis is classifying the polarity of a given text at the document,
sentence, or feature/aspect level whether the expressed opinion in a document, a sentence
or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity"
sentiment classification looks, for instance, at emotional states such as "angry," "sad," and
"happy."
In BI tool, sentiment analysis is used to determine customer reviews or opinions. This is done
by scanning every sentence from the end for sentimental words such as good. Every positive
word increments the overall sentiment while a negative decrements the overall sentiment. An
overall value less than zero means negative sentiment and vice versa.
12
PROGRESS
Work done as of today in part completion of the project has been stated below:
1. Technology Stack regarding the topic taken was prepared and studied thoroughly.
2. Data Scraper has been made and data extraction is under progress.
25 % complete
75%
remainin
13
REFERENCES
14