Session I

Big Data Analytics
Dr G Sudha Sadasivam Professor CSE, PSGCT

Data is a new class of economic asset, like currency and gold Source: World Economic Forum 2012
Data is the new raw material
Agenda
Big Data NoSQL databases Hadoop MR examples Case Study
Genome analysis Prediction of stock prices
1. SCIENCE PARADIGM SHIFT

observation, description and experimentation Theory development
Volume , variety and RT data
Simulation and models
Trend: New Big Data becoming commonplace

10 Terabytes per day
Transactions: 46 Terabytes per year Genomes: Petabytes per year
7 Terabytes per day
20 Petabytes per day
Call Records: 5 Terabytes per day
New Video Uploads: 4.5 Terabytes per day 100 Terabytes per year Massive Volumes of Data .. LHC: 40 Terabytes per second
Challenge
Big Datas characteristics are challenging conventional information management architectures Massive and growing amounts of information residing internal and external to the organization Unconventional semi structured or unstructured (diverse) including web pages, log files, social media, click-streams, instant messages, text messages, emails, sensor data from active and passive systems, etc. Changing information
Sentiment analytics Multi-Channel analytics 5
Transaction Claim fraud analytics analytics
Warranty claim analytics
Surveillance analytics
CDR analytics
What is big data?

A massive volume of both structured and unstructured data that is so large that it's difficult to store, analyse, process, share, visualise and manage with traditional database and software techniques. Term was introduced by Roger Magoulas of Oreilly in 2005
History
1) 1970 - atmospheric and oceanic env. 2) Since 2008 more references 3)different disciplines including earth, health, engineering, arts and humanities and environmental sciences. 4) computer science,
What Makes it Big Data? (V4)

SOCIAL
BLOG
SMART METER
1011001010010 0100110101010 1011100101010 100100101
VOLUME
VELOCITY
VARIETY
VALUE
VolumeGigabyte(109), Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabytes(1021) Variety: Structured,semi-structured, unstructured; Text, image, audio, video, record Velocity (Dynamic, sometimes time-varying)
Where Do We See Big Data?
SOCIAL
Datawarehouses
OLTP
Social Networks
Scientific Devices
Everywhere
Big Data vis--vis Existing Communities

Variety
Machine Learning NLP
Big Data
Databases
Volume
Velocity
Complex Event Processing
Big data use cases

Todays Challenge Healthcare Expensive office visits Manufacturing In-person support Location-Based Services Based on home zip code Public Sector Standardized services Retail One size fits all marketing New Data Remote patient monitoring Product sensors Whats Possible Preventive care, reduced hospitalization Automated diagnosis, support Geo-advertising, traffic, local search Tailored services, cost reductions Sentiment analysis segmentation
Real time location data
Citizen surveys
Social media
Why Is Big Data Important?

US HEALTH CARE MANUFACTURING GLOBAL PERSONAL LOCATION DATA EUROPE PUBLIC SECTOR ADMIN US RETAIL
Increase Decrease dev., Increase service Increase Increase net industry value assembly costs provider revenue industry value margin by per year by per year by by by
$300 B
50%
$100 B
250 B
60+%
2. RDBMS Performance
PROBLEMS: RIGID SCHEMA, SCALABILITY, PERFORMANCE, ACID properties
Data Trends
Trend 1: Size Trend 2: Connectedness Trend 3: Semi-structure Trend 4: Architecture
Web / New generation data

1. Having tables with lots of columns, each of which is only used by a few rows. 2. Having attribute tables. 3. Connected (many-to-many relationships / treelike characteristics. 4. Requiring frequent schema changes.
Next Generation Databases (Web scale) 2009 NOSQL

non-relational: flat file
Problems in relational scalability, rigid schema, performance
distributed, open-source: data partitioning horizontal scalable: new columns / nodes can be added Vertically scalable: more info can be added schema-free replication support, easy API, Consistency
NoSQL Not only SQL 1990s by Carlo Strozzi With/without ACID properties, no predefined schema, no join
NoSQL properties
CONSISTENCY : All database clients see the same data, even with concurrent updates. AVAILABILITY : All database clients are able to access same version of the data. PARTITION TOLERANCE : The database can be split over multiple servers. ACID properties in SQL
NoSQL Data models
Advantages of NoSQL
Cheap, easy to implement Data are replicated and can be partitioned Easy to distribute Don't require a schema Can scale up and down Quickly process large amounts of data Relax the data consistency requirement (CAP)
Disadvantages
New and sometimes buggy Data is generally duplicated, potential for inconsistency No standardized schema No standard format for queries No standard language Most NoSQL systems avoid in-memory storage No guarantee of support
3. Hadoop myth vs reality

Hadoop is not a direct replacement for enterprise data data stores that are used to manage structured or transactional data.
It augments enterprise data architectures by providing an efficient way for storing, processing, managing and analyzing large volumes of semistructured or un-structured data.
Hadoop is useful across virtually every vertical industry.
Hadoop Some Use Cases

Digital marketing automation Log Analysis and Event Correlation
Fraud detection and prevention

Predictive modeling for new drugs Social network and relationship analysis
Perform ETL ( Extract Transform Load ) functions on unstructured data
Image Correlation and Analysis

Collaborative Filtering
22
Apache Tomcat
Hadoop What do we expect from it ?

If we analyze the mentioned use cases, we realize that
Heterogeneous data from various sources.
Real time processing of incoming data.
Need a connect to their existing RDBMS
Need for a Distributed File System
Need for data warehouse
Need for a Scalable database
GUI to operate and Develop Applications for Hadoop
Need for a Framework for Parallel Compute
Need for a Machine Learning and Data Mining requirements 23
Hadoop Components
HDFS Distributed
File System
Chukwa To monitor Large Distributed System Hue GUI to operate & develop Hadoop Applications Flume To move Large Data post processing efficiently
Mahout MapReduce Distributed Processing of large Data

sets
ZooKeeper Co-ordination Service for
Dist Apps
Hive Data
Warehousing framework
HBase Scalable Distributed DB. Supports Structured

Data
Pig Framework for Parallel Computation Oozie Workflow Service to manage Data Processing Jobs
Apache Zookeeper
Avro Data Serialization System SQOOP Connector to Structured Database
Many more .
Apache Tomcat
24
Hadoop Whos Using It ?

Uses Hadoop and HBase for : Social services Structured data storage Processing for internal use Uses Hadoop for : Amazon's product search indices They process millions of sessions daily for analytics.
Uses Hadoop for : Search optimization Research Uses Hadoop for : Internal log reporting/parsing systems designed to scale to infinity and beyond. web-wide analytics platform
Uses Hadoop : As a source for reporting/analytics and machine learning.
Uses Hadoop for : Databasing and analyzing Next Generation Sequencing (NGS) dataApache produced for the Cancer Zookeeper Genome Atlas (TCGA) project and other groups 25
And Many More .

Apache Tomcat
Hadoop The Various Forms Today

Apache Hadoop Native Hadoop Distribution from Apache Foundation Yahoo! Hadoop Hadoop Distribution of Yahoo
CDH Hadoop Distribution from Cloudera
GreenPlum Hadoop Hadoop Distribution from EMC
HDP Hadoop Platform from Hortonworks
M3 / M5 / M7 Hadoop Distribution from MAPR
Project Serengeti Vmwares Implementation of Hadoop on Vcenter
And More
Apache Zookeeper
Apache Tomcat
26
Hadoop and MR Programming
A framework for running applications on large clusters of commodity hardware ( 1 0 0 0 nodes) which produces huge data (petabytes zetabytes) and to process it Open source Apache Software Foundation Project
Hadoop Includes HDFS a distributed filesystem to distribute data Map/Reduce HDFS implements this data-parallel programming model.
CONCEPT Moving computation is more efficient than moving large data
Cluster node runs both DFS and MR
Hadoop Cluster Architecture:
HDFS Architecture
NameNode: filename, offset> blockid, block > datanode DataNode: maps block > local disk Secondary NameNode: periodically merges edit logs Block is also called chunk
Dataflow in Hadoop
Submit job
map
schedule
reduce
map
reduce
Dataflow in Hadoop
Read Input File Block 1 HDFS Block 2
map
reduce
map
reduce
Dataflow in Hadoop
Finished map
Local FS
Finished + Location reduce
map
Local FS
reduce
Dataflow in Hadoop
map
Local FS
reduce HTTP GET
map
Local FS
reduce
Dataflow in Hadoop
reduce
Write Final Answer HDFS
reduce
4. MR Examples
CALCULATING
PI
The area of the square, denoted As = (2r)^2 or 4r^2. The area of the circle, denoted Ac, is pi * r2.
pi= 4 * No of pts on the circle / num of points on the square Count the number of generated points that are both in the circle and in the s quare MAP PI = 4 * r REDUCE
Restricted parallel programming model meant for large clusters

User implements Map() and Reduce()
Ex: WORD COUNT EXAMPLE

Divide the document and analyse one line in one mapper Reducer sums up all counts
Each mapper counts words in 1 line
File Hello World Bye World Hello Hadoop Goodbye Hadoop Map For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
The output of the first combine: < Bye, 1> < Hello, 1> < World, 2> The output of the second combine: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Thus the output of the job (reduce) is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
Map()
Input <filename, file text> Parses file and emits <word, count> pairs
eg. <Hello, 1>
Reduce()
Sums all values for the same key and emits <word, TotalCount>
eg. <Hello, ( 1 1) > => <hello, 2>
Parallelisation - Hadoop MapReduce

File Hello World Bye World Hello Hadoop GoodBye Hadoop < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Hello, 1> < World, 2>
< Goodbye, 1> < Hadoop, 2> < Hello, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
I work in a company where the Engineering Department guys produce an amazing amount of CAD files (Computer Assisted Design). So over the years we ended up having hundreds of thousands if not millions of files hosted on different Filler Systems. But quite often, the engineers need to access those files to modify/evolve/ consult the information inside. The problem is that even though the engineers know precisely the name of the file they want, it takes quite a while (sometimes more than an hour) for the Filer System to actually find it and send it back to the engineer's PC. And that is because no indexing system exists on the Filer Hosting System (the system tests every single inode until the correct one is found). The files are not very big (a couple of dozens of MB) - but there are so many of them... Is hadoop suitable???
5. Genome clustering
Genome is a DNA sequence (A T G C) Features characterize a genome motifs are extracted MR1 AAAGGGTTTCCCAAAG Mapper: AAA- 1, AAG- 1, AGG- 1, GG G - 1, GGT- 1, GTT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1,CCA- 1, CAA1, AAA- 1, AAG- 1 Reducer: AAA- 2, AA G- 2, AGG- 1, GGG - 1, GGT- 1, G TT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1, CCA- 1, CAA- 1
R ead the input sequence
Read input
Obtain [1x64] Feature Descriptor vector (MR 1)
Compare to the existing species by clustering (MR 2)
Identify the species

45
K Means
Fix up cluster centres Find the distance between samples and cluster centres Find which cluster centre is closest to a given sample Add the sample to the cluster Re-evaluate cluster centres
Mapreduce kmeans
Key clusterid, value motif,count Map calculates distance of each point from centers Keyout clusterid , valueout-distance to each cluster center; Reducer: finds closest cluster center; recalculates new centers for next iteration Out- new centers
M R K -Means
L et k be no. of features in a sample, V1(k) is the value for kth feature in the sample. V2(k) is value of kth feature of the cluster centre Mapper: Distance to ith cluster centre is given by
R educer: min [Si] is evaluated for a species considering all cluster centres i Centroid is evaluated for all features (k) in sample V2 and cluster centre V1. This is done to update centroid
Accuracy comparison
Accuracy in percentage
Length of Input Sequence
50
Time Efficiency
Length of Input Sequence
51
HADOOP ON AWS
Master Instance Core Instances
3 4
Task Instances
P ERFO RMANCE
FILE SIZE UPLOADING TIME COMPUTATION TIME OVERALL TIME
15.9 KB
1 sec
4 min
6 min
1 MB
54.4 MB 118.9 MB
3 sec
5 min 13 min
4 min
6 min 8 min
6 min
9 min 13 min
250 200 150 100
50
0
sss
ssm
MR Total
ssl
mms
NUMBER OF INSTANCES
TYPE OF INSTANC E S
COMPUTATION TIME
OVERALL TIME
M-1, C- 2 , T -1
M-s, C- s , T- s
8 min
13 min
M-1, C- 2 , T -1
M-m, C- m , T- m 5 min
9 min
M-1, C- 2 , T -1
M-s, C- s , T- m
6 min
11 min
M-1, C- 2 , T -1
M-s, C- s , T- l
7 min
11 min
Local vs cloud cluster

250 200 150
100
50 0 1Kb 15KB Hadoop AWS 15MB 119MB
6. Case Study Prediction of stock prices

A Support Vector Machine (SVM) is a supervised learning method that analyzes data used for classification and regression analysis
Learning in SVM is done by finding a hyperplane

Separates the training set in a feature space using a kernel function which is inner product of input space features
Binary classification can be viewed as the task of separating classes in feature space
f(x) = sign(wTx + b)
w Tx + b = 0 w Tx + b > 0 w Tx + b < 0 where w represents normal vector to hyperplane x represents training patterns b represents threshold represents coefficients s represents support vectors
W = i i si
Methodology
Steps
1. Data Collection download data sets for month/ year/day 2. Input feature selection MA and RSI
Vol avg -- +,- points
3. Finding support vectors -- S1, S2 1, 2 4. Construction of hyper plane y = wT x + b 5. Classification of test data using hyper plane and X values into +ive / -ive points
2. a) Input feature
Moving Average
MA = n i=1 CPi n where, CPi = Closing Price on ith day
Relative Strength Index

RSI = 100 * RS ; RS = AU 1+RS AD AU = total upwards price changes during the past n days AD = total downwards prices changes during the past n days
Input Data & Feature selection example

DATE 26.8.11 19.8.11 12.8.11 5.8.11 HIGH 8.87 9.48 9.27 10.83 LOW 7.04 7.70 8.32 7.74 OPEN 8.19 8.81 9.20 8.60 CLOSE 7.53 8.08 8.42 9.15 VOLUME 4.3 2.5 1.6 2.8 OPEN-CLOSE 0.66 0.73 0.78 -0.55 (0.66+0.73+0.78)/0.55 RS = 3.94 RSI= 3.94/(1+3.94)*100 Y = RSI = 79.87
X= MA = Vol Avg = 8.3 2.8 Total data set avg = 1.54 + point 81.61% increase
2.b)Repeat for other months also

Month August June May April March November July February January December MA 8.30 8.97 10.73 14.54 13.74 11.79 10.61 12.32 10.44 12.78 RSI 79.87 85.81 74.09 55.35 45.94 24.24 42.53 31.50 83.97 28.05 +/-VA + + + + + + -
October
8.81
31.03
3. Support Vectors
Support vectors are calculated by using Euclidean distance formula In the Euclidean plane, if X = (x1, x2) and Y = (y1, y2) then the distance is given by, D (X,Y) = ( x1 - y1 ) 2 + ( x2 - y2 ) 2 where X and Y are positive points and negative points of Moving Average and Relative Strength Index
+ive point
8.30,79.87
-ive point
10.61,42.53 12.32, 31.50 10.44, 83.97 12.78, 28.05 8.81, 31.03
distance
37.44 48.53 4.62 52.01 48.84 43.31 54.41 2.35 (min) 57.88 54.77
8.97, 85.81
10.61,42.53 12.32, 31.50 10.44, 83.97 12.78, 28.05 8.81, 31.03
Others are also calculated and min is found S1= 8.97,85.81 S2= 10.44,83.97 8.97 10.44 S1 = 85.81 S2 = 83.97 1 1
3.b)Finding 1 and 2
1 S1 S1 + 2 S1 S2 = + 1 1 S1 S2 + 2 S2 S2 = - 1 7444.81 1 + 7300.11 2 = +1 7300.11 1 + 7160.95 2 = -1 1 = -0.71; 2 = 0.73
4. Construction of Hyperplane
Hyperplane is constructed using the equation , y = wT x + b Where, w = i si w represents normal vector to hyperplane x represents training patterns b represents threshold represents coefficients s represents support vector values are calculated by using the equation
y = wT x + b wT = i Si = 1 s1 + 2 s2 b = 0.02 y= 1.25 x + 0.02 hyperplane 0.37
5. Testing set
MA=x1= 12.44, RSI =x2= 50 to predict Volume avg y= 1.25 T x + 0.02 hyperplane 0.37
Y = 34.05 + 0.02 = 34.07 predicted value (vol avg) Overall avg = 18 Increased % = (34.07-18 )/18 = 89%
Multi SVM
Support vector machine is a powerful tool for binary classification, capable of generating very fast classifier functions following a training period
To extend binary class scenario to multi-class scenario, decompose an M-class problem into a series of two-class problems
Approaches for Multi class SVM

Multiclass ranking SVMs, in which one SVM decision function attempts to classify all classes One-against-all classification, in which there is one binary SVM for each class to separate members of that class from members of other classes Pairwise classification, in which there is one binary SVM for each pair of classes to separate members of one class from members of the other Binary decision tree classification
Classes - 100 to -50, -50 to 0, 0 to 50, 50 to 100

-100 to -50 -50 to 0 One against all
0 to 50
50 to 100
pairwise
BDT
-100 to 100
-100 to 0
0 to 100
-100 to -50
-50 to 0
0 to 50
50 to 100
7. Conclusion - Big Data Is About

Tapping into diverse data sets
Finding and monetizing unknown relationships Data driven business decisions
Big Data in Action

DECIDE
ACQUIRE
Make Better Decisions Using Big Data

ANALYZE ORGANIZE
Big Data in Action

DECIDE
ACQUIRE
Acquire all available data
ANALYZE
ORGANIZE
Big Data in Action

DECIDE
ACQUIRE
Organize and distill big data using massive parallelism

ANALYZE ORGANIZE
Big Data in Action

DECIDE
ACQUIRE
Analyze all your data, at once
ANALYZE
ORGANIZE
Big Data in Action

DECIDE
ACQUIRE
Decide based on real-time big data
ANALYZE
ORGANIZE
Summary
Big data NoSQL Hadoop Case Study
Questions

Session I

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Session I

Hochgeladen von

Copyright:

Verfügbare Formate

Big Data Analytics

Dr G Sudha Sadasivam Professor CSE, PSGCT

Data is the new raw material

1. SCIENCE PARADIGM SHIFT

Volume , variety and RT data

Simulation and models

Trend: New Big Data becoming commonplace

Transactions: 46 Terabytes per year Genomes: Petabytes per year

7 Terabytes per day

20 Petabytes per day

Call Records: 5 Terabytes per day

Sentiment analytics Multi-Channel analytics 5

Transaction Claim fraud analytics analytics

Warranty claim analytics

What is big data?

What Makes it Big Data? (V4)

1011001010010 0100110101010 1011100101010 100100101

Where Do We See Big Data?

Big Data vis--vis Existing Communities

Machine Learning NLP

Complex Event Processing

Big data use cases

Real time location data

Why Is Big Data Important?

PROBLEMS: RIGID SCHEMA, SCALABILITY, PERFORMANCE, ACID properties

Web / New generation data

Next Generation Databases (Web scale) 2009 NOSQL

NoSQL Data models

3. Hadoop myth vs reality

Hadoop is useful across virtually every vertical industry.

Hadoop Some Use Cases

Fraud detection and prevention

Image Correlation and Analysis

Hadoop What do we expect from it ?

Real time processing of incoming data.

Need a connect to their existing RDBMS

Need for a Distributed File System

Need for data warehouse

Need for a Scalable database

GUI to operate and Develop Applications for Hadoop

Need for a Framework for Parallel Compute

Need for a Machine Learning and Data Mining requirements 23

Mahout MapReduce Distributed Processing of large Data

ZooKeeper Co-ordination Service for

HBase Scalable Distributed DB. Supports Structured

Avro Data Serialization System SQOOP Connector to Structured Database

Hadoop Whos Using It ?

Uses Hadoop : As a source for reporting/analytics and machine learning.

And Many More .

Hadoop The Various Forms Today

CDH Hadoop Distribution from Cloudera

GreenPlum Hadoop Hadoop Distribution from EMC

HDP Hadoop Platform from Hortonworks

M3 / M5 / M7 Hadoop Distribution from MAPR

Project Serengeti Vmwares Implementation of Hadoop on Vcenter

Hadoop and MR Programming

CONCEPT Moving computation is more efficient than moving large data

Cluster node runs both DFS and MR

Hadoop Cluster Architecture:

Read Input File Block 1 HDFS Block 2

Finished + Location reduce

reduce HTTP GET

Write Final Answer HDFS