You are on page 1of 44

Seminar

On
BIG DATA MINING: A CHALLENGE
AND HOW TO MANAGE IT

Submitted To: Submitted By:


Dinesh and Jitender
INTRODUCTION
Big Data is a new term used to identify the datasets that due
to their large size and complexity,
We call them BIG DATA because we can not manage them

with our current methodologies or data mining software tools.


Big Data mining is the capability of extracting useful

information from these large datasets or streams of data, that


due to its volume, variability, and velocity, it was not possible
before to do it.
The Big Data challenge is becoming one of the most exciting

opportunities for the next years.


We present in this issue, a broad overview of the topic, its

current status on Big Data mining.


This paper shows the challenge and tools to manage
heterogeneous information frontier in Big Data mining
research
WHAT IS BIG DATA?
Big Data is similar to small data, but bigger in
size

but having data bigger it requires different


approaches:
Techniques, tools and architecture

an aim to solve new problems or old problems in a


better way

Big Data generates value from the storage and


processing of very large quantities of digital
information that cannot be analyzed with
traditional computing techniques.
WHAT IS BIG DATA
Walmart handles more than 1 million customer
transactions every hour.
Facebook handles 40 billion photos from its user
base.

THREE CHARACTERISTICS OF BIG DATA V3S

Volume Velocity Variety


Data Data Data
quantity Speed Types
1ST CHARACTER OF BIG DATA
VOLUME

A typical PC might have had 10 gigabytes of storage in


2000.

Today, Facebook ingests 500 terabytes of new data every


day.

The smart phones, the data they create and consume;


sensors embedded into everyday objects will soon result in
billions of new, constantly-updated data feeds containing
environmental, location, and other information, including
video.
DATA
VELOCITY(SPEED)
High-frequency stock trading algorithms reflect
market changes within microseconds.

machine to machine processes exchange data between


billions of devices

infrastructure and sensors generate massive log data


in real-time

on-line gaming systemssupport millions of concurrent


users, each producing multiple inputs per second.
VARIETY(DATA TYPES
IMAGES, VIDEO,SOUND)
Big Data isn't just numbers, dates, and strings.
Big Data is also geospatial data, 3D data, audio
and video, and unstructured text, including log
files and social media.

Traditionaldatabase systems were designed to


address smaller volumes of structured data, fewer
updates or a predictable, consistent data
structure.

Big Data analysis includes different types of data


PROCESSING BIG DATA
Integrating disparate data stores
Mapping data to the programming framework

Connecting and extracting data from storage

Transforming data for processing

Subdividing data in preparation for Hadoop


MapReduce

Employing Hadoop MapReduce


Creating the components of Hadoop MapReduce jobs

Distributing data processing across server farms

Executing Hadoop MapReduce jobs

Monitoring the progress of job flows


THE STRUCTURE OF BIG DATA
Structured
Most traditional data
sources

Semi-structured
Many sources of big data

Unstructured
Video data, audio data

10
WHY BIG DATA
Growth of Big Data is needed

Increase of storage capacities

Increase of processing power

Availability of data(different data types)

Every day we create 2.5 quintillion bytes of data; 90%


of the data in the world today has been created in
the last two years alone
WHY BIG DATA

FB generates 10TB
daily

Twitter generates 7TB


of data
Daily

IBM claims 90% of


todays
stored data was
generated
in just the last two
years.
HOW IS BIG DATA
DIFFERENT?

1) Automatically generated by a machine


(e.g. Sensor embedded in an engine)

2) Typically an entirely new source of


data
(e.g. Use of the internet)

3) Not designed to be friendly


(e.g. Text streams)
13
4) May not have much values
Need to focus on the important part
DATA GENERATION POINTS
EXAMPLES

Mobile Devices

Microphones

Readers/Scanners

Science facilities

Programs/ Software

Social Media

Cameras
BIG DATA ANALYTICS

Examining large amount of data

Appropriate information

Identification of hidden patterns, unknown correlations

Competitive advantage

Better business decisions: strategic and operational

Effective marketing, customer satisfaction, increased


revenue
POTENTIAL VALUE OF BIG
DATA

$300 billion potential


annual value to US health
care.

$600 billion potential


annual consumer surplus
from using personal
location data.

60% potential in retailers


operating margins.
INDIA BIG DATA
Gaining attraction

Huge market opportunities for IT services


(82.9% of revenues) and analytics firms
(17.1 % )

Current market size is $200 million. By 2015 $1


billion

Theopportunity for Indian service providers lies


in offering services around Big Data
implementation and analytics for global
multinationals
BENEFITS OF BIG DATA
Real-time big data isnt just a process for storing of
data in a data warehouse, Its about the ability to
make better decisions and take meaningful actions
at the right time.

Fast forward to the present and technologies like


Hadoop give you the scale and flexibility to store
data before you know how you are going to process
it.

Technologies such as MapReduce,Hive and Impala


enable you to run queries without changing the
data structures underneath.
BENEFITS OF BIG DATA
Our newest research finds that organizations are using
big data to target customer-centric outcomes, tap into
internal data and build a better information
ecosystem.

Big Data is already an important part of the $64


billion database and data analytics market

It offers commercial opportunities of a comparable


scale to enterprise software in the late 1980s

And the Internet boom of the 1990s, and the social


media explosion of today.
WHAT IS BIG DATA?
"Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process
optimization (Gartner 2012)
Complicated (intelligent) analysis of data may
make a small data appear to be big
Bottom line: Any data that exceeds our current
capability of processing can be regarded as big
WHAT IS DATA MINING?

Discovery of useful, possibly unexpected, patterns in


data
Extraction of implicit, previously unknown and
potentially useful information from data
Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data in
order to discover meaningful patterns
DATA MINING TASKS
Classification [Predictive]
Clustering [Descriptive]

Association Rule Discovery [Descriptive]

Sequential Pattern Discovery [Descriptive]

Regression [Predictive]
CLASSIFICATION: DEFINITION
Given a collection of records (training set )
Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of the
values of other attributes.
Goal: previously unseen records should be assigned
a class as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and
test sets, with training set used to build the model and
test set used to validate it.
CLUSTERING

income

education

age

24
K-MEANS CLUSTERING

25
ASSOCIATION RULE MINING
t ion er ts
ac m c
a ns
d sto odu ht
t r i cu id pr oug
b

26
sales
market-basket
records:
data

Trend: Products p5, p8 often bough together


Trend: Customer 12 likes product p9
BIG VELOCITY
Sensor tagging everything of value sends velocity
through the roof
E.g. car insurance

Smart phones as a mobile platform sends velocity


through the roof

State of multi-player internet games must be


recorded sends velocity through the roof

27
BIG DATA STANDARDIZATION
CHALLENGES (1)
Big Data use cases, definitions, vocabulary and reference architectures
(e.g. system, data, platforms, online/offline)
Specifications and standardization of metadata including data provenance

Application models (e.g. batch, streaming)

Query languages including non-relational queries to support diverse data

types (XML, RDF, JSON, multimedia) and Big Data operations (e.g.
matrix operations)
Domain-specific languages

Semantics of eventual consistency

Advanced network protocols for efficient data transfer

General and domain specific ontologies and taxonomies for describing data

semantics including interoperation between ontologies

Source : ISO

28
Big Data Standardization
Challenges (2)
Big Data security and privacy access controls
Remote, distributed, and federated analytics (taking

the analytics to the data) including data and


processing resource discovery and data mining
Data sharing and exchange

Data storage, e.g. memory storage system, distributed

file system, data warehouse, etc.


Human consumption of the results of big data analysis

(e.g. visualization)
Interface between relational (SQL) and non-relational

(NoSQL)
Source : ISO
Big Data Quality and Veracity description and

management
29
TOOLS FOR MANAGING BIG DATA
Hadoop is an open-source framework from Apache that allows to
store and process big data in a distributed environment across
clusters of computers using simple programming models
Hadoop is a large-scale distributed batch processing infrastructure.

While it can be used on a single machine, its true power lies in its
ability to scale to hundreds or thousands of computers, each with
several processor cores. Hadoop is also designed to efficiently
distribute large amounts of work across a set of machines.
Challenges at Large Scale

Performing large-scale computation is difficult. To work with this

volume of data requires distributing parts of the problem to


multiple machines to handle in parallel. Whenever multiple
machines are used in cooperation with one another, the probability
of failures rises. In a single-machine environment, failure is not
something that program designers explicitly worry about very
often: if the machine has crashed, then there is no way for the
program to recover anyway.
R
R programming language is the preferred choice amongst data
analysts and data scientists
There is no doubt that R is the most preferred programming tool for

statisticians, data scientists, data analysts and data architects but it


falls short when working with large datasets.
One major drawback with R programming language is that all objects

are loaded into the main memory of a single machine. Large datasets
of size petabytes cannot be loaded into the RAM memory;
this is when Hadoop integrated with R language, is an ideal solution.

To adapt to the in-memory, single machine limitation of R


programming language, data scientists have to limit their data
analysis to a sample of data from the large data set.

R and Hadoop were not natural friends but with the advent of novel
packages like Rhadoop, RHIVE, and RHIPE- the two seemingly
different technologies, complement each other for big data analytics
and visualization.
STORM
Storm is a distributed real-time computation
system for processing large volumes of high-
velocity data.
Storm is extremely fast, with the ability to process

over a million records per second per node on a


cluster of modest size. Enterprises combine it with
other data access applications in Hadoop to
prevent undesirable events or to optimize positive
outcomes.
Some of specific new business opportunities

include: real-time customer service management,


data monetization, operational dashboards, or
cyber security analytics and threat detection.
APACHE MAHOUT
Apache Mahout is a powerful, scalable machine-learning library that
runs on top of Hadoop MapReduce.
We are living in a day and age where information is available in abundance.

The information overload has scaled to such heights that sometimes it


becomes difficult to manage our little mailboxes! Imagine the volume of
data and records some of the popular websites (the likes of Facebook,
Twitter, and Youtube) have to collect and manage on a daily basis. It is not
uncommon even for lesser known websites to receive huge amounts of
information in bulk.
Normally we fall back on data mining algorithms to analyze bulk data to

identify trends and draw conclusions. However, no data mining algorithm


can be efficient enough to process very large datasets and provide outcomes
in quick time, unless the computational tasks are run on multiple machines
distributed over the cloud.
We now have new frameworks that allow us to break down a computation

task into multiple segments and run each segment on a different machine.
Mahout is such a data mining framework that normally runs coupled with
the Hadoop infrastructure at its background to manage huge volumes of
data
Apache Mahout is an open source project that is
primarily used for creating scalable machine
learning algorithms. It implements popular
machine learning techniques such as:
Recommendation

Classification

Clustering

Apache Mahout started as a sub-project of


Apaches Lucene in 2008. In 2010, Mahout
became a top level project of Apache.
APACHE S4
S4 is a general-purpose, distributed, scalable,
fault-tolerant, pluggable platform that allows
programmers to easily develop applications for
processing continuous unbounded streams of
data.
BIG DATA MINING TOOLS

The Big Data phenomenon is intrinsically related


to the open source software revolution. Large
companies such as Facebook, Yahoo!, Twitter,
LinkedIn benefit and contribute to open source
projects. Big Data infrastructure deals with
Hadoop, and other related software as:

Apache Hadoop : software for data-intensive


distributed applications, based in the MapReduce
programming model and a distributed file system
called Hadoop Distributed Filesystem (HDFS).
Hadoop allows writing applications that rapidly
process large amounts of data in parallel on large
clusters of compute nodes.
A MapReduce job divides the input dataset into

independent subsets that are processed by map


tasks in parallel. This step of mapping is then
followed by a step of reducing tasks. These reduce
tasks use the output of the maps to obtain the
final result of the job.
Apache S4: platform for processing continuous
data streams. S4 is designed specifically for
managing data streams. S4 apps are designed
combining streams and processing elements in
real time.
Storm: software for streaming data-intensive
distributed applications, similar to S4, and
developed by Nathan Marz at Twitter.
In Big Data Mining, there are many open source initiatives.
The most popular are the following:

Apache Mahout: Scalable machine learning and data


mining open source software based mainly in Hadoop. It has
implementations of a wide range of machine learning and
data mining algorithms: clustering, clas- sification,
collaborative filtering and frequent pattern mining.
R: open source programming language and software
environment designed for statistical computing and
visualization. R was designed by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand
beginning in 1993 and is used for statistical analysis of very
large data sets.

MOA: Stream data mining open source software to


perform data mining in real time. It has imple-
mentations of classification, regression, clustering
and frequent item set mining and frequent graph
mining. It started as a project of the Machine
Learning group of University of Waikato, New
Zealand, famous for the WEKA software. The
streams framework provides an environment for
defining and running stream processes using
simple XML based definitions and is able to use
MOA, Android and Storm. SAMOA is a new
upcoming software project for distributed stream
mining that will combine S4 and Storm with MOA.
Vowpal Wabbit: open source project started at
Yahoo! Research and continuing at Microsoft
Research to design a fast, scalable, useful
learning algorithmt can exceed the throughput of
any single machine network interface when doing
linear learning, via parallel learning.
MORE SPECIFIC TO BIG GRAPH MINING WE
FOUND THE FOLLOWING OPEN SOURCE
TOOLS:

Pegasus: Big graph mining system built on top


of MapReduce. It allows to find patterns and
anomalies in massive real-world graphs.

GraphLab: high-level graph-parallel system


built without using MapReduce. GraphLab
computes over dependent records which are
stored as vertices in a large distributed data-
graph.
REFERENCE
REFERENCES

[1] Apache Hadoop, http://hadoop.apache.org.

[2] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, and G. Lapis. IBM Understanding Big
Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGraw-Hill
Companies,Incorporated, 2011
[3] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed Stream Computing
Platform. In ICDM Workshops, pages 170177, 2010
[4] Storm, http://storm-project.net.

[5] Apache Mahout, http://mahout.apache.org.


[6] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.
[7] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis
http://moa. cms.waikato.ac.nz/. Journal of Machine Learning Research (JMLR), 2010.
[8] D. Laney. 3-D Data Management: Controlling Data Volume, Velocity and Variety.
META Group Research Note, February 6, 2001
[9]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the
Cloud. 2012.
[10]J. Gantz and D. Reinsel. IDC: The Digital Universe in 2020: Big Data, Bigger Digital
Shadows, and Biggest Growth in the Far East. December 2012.
THANK YOU.