Sie sind auf Seite 1von 82

Big Data Analytics

Dr G Sudha Sadasivam Professor CSE, PSGCT


Data is a new class of economic asset, like currency and gold Source: World Economic Forum 2012

Data is the new raw material

Agenda
Big Data NoSQL databases Hadoop MR examples Case Study
Genome analysis Prediction of stock prices

1. SCIENCE PARADIGM SHIFT


observation, description and experimentation Theory development

Volume , variety and RT data

Simulation and models

Trend: New Big Data becoming commonplace


10 Terabytes per day

Transactions: 46 Terabytes per year Genomes: Petabytes per year

7 Terabytes per day

20 Petabytes per day

Call Records: 5 Terabytes per day

New Video Uploads: 4.5 Terabytes per day 100 Terabytes per year Massive Volumes of Data .. LHC: 40 Terabytes per second

Challenge
Big Datas characteristics are challenging conventional information management architectures Massive and growing amounts of information residing internal and external to the organization Unconventional semi structured or unstructured (diverse) including web pages, log files, social media, click-streams, instant messages, text messages, emails, sensor data from active and passive systems, etc. Changing information

Sentiment analytics Multi-Channel analytics 5

Transaction Claim fraud analytics analytics

Warranty claim analytics

Surveillance analytics

CDR analytics

What is big data?


A massive volume of both structured and unstructured data that is so large that it's difficult to store, analyse, process, share, visualise and manage with traditional database and software techniques. Term was introduced by Roger Magoulas of Oreilly in 2005

History

1) 1970 - atmospheric and oceanic env. 2) Since 2008 more references 3)different disciplines including earth, health, engineering, arts and humanities and environmental sciences. 4) computer science,

What Makes it Big Data? (V4)


SOCIAL

BLOG

SMART METER

1011001010010 0100110101010 1011100101010 100100101

VOLUME

VELOCITY

VARIETY

VALUE

VolumeGigabyte(109), Terabyte(1012), Petabyte(1015), Exabyte(1018), Zettabytes(1021) Variety: Structured,semi-structured, unstructured; Text, image, audio, video, record Velocity (Dynamic, sometimes time-varying)

Where Do We See Big Data?

SOCIAL

Datawarehouses

OLTP

Social Networks

Scientific Devices

Everywhere

Big Data vis--vis Existing Communities


Variety

Machine Learning NLP

Big Data
Databases
Volume

Velocity

Complex Event Processing

Big data use cases


Todays Challenge Healthcare Expensive office visits Manufacturing In-person support Location-Based Services Based on home zip code Public Sector Standardized services Retail One size fits all marketing New Data Remote patient monitoring Product sensors Whats Possible Preventive care, reduced hospitalization Automated diagnosis, support Geo-advertising, traffic, local search Tailored services, cost reductions Sentiment analysis segmentation

Real time location data

Citizen surveys

Social media

Why Is Big Data Important?


US HEALTH CARE MANUFACTURING GLOBAL PERSONAL LOCATION DATA EUROPE PUBLIC SECTOR ADMIN US RETAIL

Increase Decrease dev., Increase service Increase Increase net industry value assembly costs provider revenue industry value margin by per year by per year by by by

$300 B

50%

$100 B

250 B

60+%

2. RDBMS Performance

PROBLEMS: RIGID SCHEMA, SCALABILITY, PERFORMANCE, ACID properties

Data Trends
Trend 1: Size Trend 2: Connectedness Trend 3: Semi-structure Trend 4: Architecture

Web / New generation data


1. Having tables with lots of columns, each of which is only used by a few rows. 2. Having attribute tables. 3. Connected (many-to-many relationships / treelike characteristics. 4. Requiring frequent schema changes.

Next Generation Databases (Web scale) 2009 NOSQL


non-relational: flat file
Problems in relational scalability, rigid schema, performance

distributed, open-source: data partitioning horizontal scalable: new columns / nodes can be added Vertically scalable: more info can be added schema-free replication support, easy API, Consistency

NoSQL Not only SQL 1990s by Carlo Strozzi With/without ACID properties, no predefined schema, no join

NoSQL properties
CONSISTENCY : All database clients see the same data, even with concurrent updates. AVAILABILITY : All database clients are able to access same version of the data. PARTITION TOLERANCE : The database can be split over multiple servers. ACID properties in SQL

NoSQL Data models

Advantages of NoSQL
Cheap, easy to implement Data are replicated and can be partitioned Easy to distribute Don't require a schema Can scale up and down Quickly process large amounts of data Relax the data consistency requirement (CAP)

Disadvantages
New and sometimes buggy Data is generally duplicated, potential for inconsistency No standardized schema No standard format for queries No standard language Most NoSQL systems avoid in-memory storage No guarantee of support

3. Hadoop myth vs reality


Hadoop is not a direct replacement for enterprise data data stores that are used to manage structured or transactional data.

It augments enterprise data architectures by providing an efficient way for storing, processing, managing and analyzing large volumes of semistructured or un-structured data.

Hadoop is useful across virtually every vertical industry.

Hadoop Some Use Cases


Digital marketing automation Log Analysis and Event Correlation

Fraud detection and prevention


Predictive modeling for new drugs Social network and relationship analysis
Perform ETL ( Extract Transform Load ) functions on unstructured data

Image Correlation and Analysis


Collaborative Filtering
22

Apache Tomcat

Hadoop What do we expect from it ?


If we analyze the mentioned use cases, we realize that
Heterogeneous data from various sources.

Real time processing of incoming data.

Need a connect to their existing RDBMS

Need for a Distributed File System

Need for data warehouse

Need for a Scalable database

GUI to operate and Develop Applications for Hadoop

Need for a Framework for Parallel Compute

Need for a Machine Learning and Data Mining requirements 23

Hadoop Components
HDFS Distributed
File System

Chukwa To monitor Large Distributed System Hue GUI to operate & develop Hadoop Applications Flume To move Large Data post processing efficiently

Mahout MapReduce Distributed Processing of large Data


sets

ZooKeeper Co-ordination Service for

Dist Apps

Hive Data

Warehousing framework

HBase Scalable Distributed DB. Supports Structured


Data

Pig Framework for Parallel Computation Oozie Workflow Service to manage Data Processing Jobs
Apache Zookeeper

Avro Data Serialization System SQOOP Connector to Structured Database

Many more .

Apache Tomcat

24

Hadoop Whos Using It ?


Uses Hadoop and HBase for : Social services Structured data storage Processing for internal use Uses Hadoop for : Amazon's product search indices They process millions of sessions daily for analytics.

Uses Hadoop for : Search optimization Research Uses Hadoop for : Internal log reporting/parsing systems designed to scale to infinity and beyond. web-wide analytics platform

Uses Hadoop : As a source for reporting/analytics and machine learning.

Uses Hadoop for : Databasing and analyzing Next Generation Sequencing (NGS) dataApache produced for the Cancer Zookeeper Genome Atlas (TCGA) project and other groups 25

And Many More .


Apache Tomcat

Hadoop The Various Forms Today


Apache Hadoop Native Hadoop Distribution from Apache Foundation Yahoo! Hadoop Hadoop Distribution of Yahoo

CDH Hadoop Distribution from Cloudera

GreenPlum Hadoop Hadoop Distribution from EMC

HDP Hadoop Platform from Hortonworks

M3 / M5 / M7 Hadoop Distribution from MAPR

Project Serengeti Vmwares Implementation of Hadoop on Vcenter

And More

Apache Zookeeper

Apache Tomcat

26

Hadoop and MR Programming

A framework for running applications on large clusters of commodity hardware ( 1 0 0 0 nodes) which produces huge data (petabytes zetabytes) and to process it Open source Apache Software Foundation Project

Hadoop Includes HDFS a distributed filesystem to distribute data Map/Reduce HDFS implements this data-parallel programming model.

CONCEPT Moving computation is more efficient than moving large data

Cluster node runs both DFS and MR

Hadoop Cluster Architecture:

HDFS Architecture

NameNode: filename, offset> blockid, block > datanode DataNode: maps block > local disk Secondary NameNode: periodically merges edit logs Block is also called chunk

Dataflow in Hadoop
Submit job

map

schedule

reduce

map

reduce

Dataflow in Hadoop

Read Input File Block 1 HDFS Block 2

map

reduce

map

reduce

Dataflow in Hadoop

Finished map
Local FS

Finished + Location reduce

map

Local FS

reduce

Dataflow in Hadoop

map

Local FS

reduce HTTP GET

map

Local FS

reduce

Dataflow in Hadoop

reduce

Write Final Answer HDFS

reduce

4. MR Examples

CALCULATING

PI

The area of the square, denoted As = (2r)^2 or 4r^2. The area of the circle, denoted Ac, is pi * r2.

pi= 4 * No of pts on the circle / num of points on the square Count the number of generated points that are both in the circle and in the s quare MAP PI = 4 * r REDUCE

Restricted parallel programming model meant for large clusters


User implements Map() and Reduce()

Ex: WORD COUNT EXAMPLE


Divide the document and analyse one line in one mapper Reducer sums up all counts

Each mapper counts words in 1 line

File Hello World Bye World Hello Hadoop Goodbye Hadoop Map For the given sample input the first map emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> The second map emits: < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

The output of the first combine: < Bye, 1> < Hello, 1> < World, 2> The output of the second combine: < Goodbye, 1> < Hadoop, 2> < Hello, 1> Thus the output of the job (reduce) is: < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

Map()
Input <filename, file text> Parses file and emits <word, count> pairs
eg. <Hello, 1>

Reduce()
Sums all values for the same key and emits <word, TotalCount>
eg. <Hello, ( 1 1) > => <hello, 2>

Parallelisation - Hadoop MapReduce


File Hello World Bye World Hello Hadoop GoodBye Hadoop < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> < Hello, 1> < World, 2>
< Goodbye, 1> < Hadoop, 2> < Hello, 1> < Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

I work in a company where the Engineering Department guys produce an amazing amount of CAD files (Computer Assisted Design). So over the years we ended up having hundreds of thousands if not millions of files hosted on different Filler Systems. But quite often, the engineers need to access those files to modify/evolve/ consult the information inside. The problem is that even though the engineers know precisely the name of the file they want, it takes quite a while (sometimes more than an hour) for the Filer System to actually find it and send it back to the engineer's PC. And that is because no indexing system exists on the Filer Hosting System (the system tests every single inode until the correct one is found). The files are not very big (a couple of dozens of MB) - but there are so many of them... Is hadoop suitable???

5. Genome clustering
Genome is a DNA sequence (A T G C) Features characterize a genome motifs are extracted MR1 AAAGGGTTTCCCAAAG Mapper: AAA- 1, AAG- 1, AGG- 1, GG G - 1, GGT- 1, GTT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1,CCA- 1, CAA1, AAA- 1, AAG- 1 Reducer: AAA- 2, AA G- 2, AGG- 1, GGG - 1, GGT- 1, G TT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1, CCA- 1, CAA- 1

R ead the input sequence

Read input

Obtain [1x64] Feature Descriptor vector (MR 1)

Compare to the existing species by clustering (MR 2)

Identify the species


45

K Means
Fix up cluster centres Find the distance between samples and cluster centres Find which cluster centre is closest to a given sample Add the sample to the cluster Re-evaluate cluster centres

Mapreduce kmeans
Key clusterid, value motif,count Map calculates distance of each point from centers Keyout clusterid , valueout-distance to each cluster center; Reducer: finds closest cluster center; recalculates new centers for next iteration Out- new centers

M R K -Means
L et k be no. of features in a sample, V1(k) is the value for kth feature in the sample. V2(k) is value of kth feature of the cluster centre Mapper: Distance to ith cluster centre is given by

R educer: min [Si] is evaluated for a species considering all cluster centres i Centroid is evaluated for all features (k) in sample V2 and cluster centre V1. This is done to update centroid

Accuracy comparison
Accuracy in percentage

Length of Input Sequence

50

Time Efficiency

Length of Input Sequence

51

HADOOP ON AWS
Master Instance Core Instances

3 4

Task Instances

P ERFO RMANCE
FILE SIZE UPLOADING TIME COMPUTATION TIME OVERALL TIME

15.9 KB

1 sec

4 min

6 min

1 MB
54.4 MB 118.9 MB

3 sec
5 min 13 min

4 min
6 min 8 min

6 min
9 min 13 min

250 200 150 100

50
0

sss

ssm
MR Total

ssl

mms

NUMBER OF INSTANCES

TYPE OF INSTANC E S

COMPUTATION TIME

OVERALL TIME

M-1, C- 2 , T -1

M-s, C- s , T- s

8 min

13 min

M-1, C- 2 , T -1

M-m, C- m , T- m 5 min

9 min

M-1, C- 2 , T -1

M-s, C- s , T- m

6 min

11 min

M-1, C- 2 , T -1

M-s, C- s , T- l

7 min

11 min

Local vs cloud cluster


250 200 150

100
50 0 1Kb 15KB Hadoop AWS 15MB 119MB

6. Case Study Prediction of stock prices


A Support Vector Machine (SVM) is a supervised learning method that analyzes data used for classification and regression analysis

Learning in SVM is done by finding a hyperplane


Separates the training set in a feature space using a kernel function which is inner product of input space features

Binary classification can be viewed as the task of separating classes in feature space
f(x) = sign(wTx + b)
w Tx + b = 0 w Tx + b > 0 w Tx + b < 0 where w represents normal vector to hyperplane x represents training patterns b represents threshold represents coefficients s represents support vectors

W = i i si

Methodology

Steps
1. Data Collection download data sets for month/ year/day 2. Input feature selection MA and RSI
Vol avg -- +,- points

3. Finding support vectors -- S1, S2 1, 2 4. Construction of hyper plane y = wT x + b 5. Classification of test data using hyper plane and X values into +ive / -ive points

2. a) Input feature
Moving Average
MA = n i=1 CPi n where, CPi = Closing Price on ith day

Relative Strength Index


RSI = 100 * RS ; RS = AU 1+RS AD AU = total upwards price changes during the past n days AD = total downwards prices changes during the past n days

Input Data & Feature selection example


DATE 26.8.11 19.8.11 12.8.11 5.8.11 HIGH 8.87 9.48 9.27 10.83 LOW 7.04 7.70 8.32 7.74 OPEN 8.19 8.81 9.20 8.60 CLOSE 7.53 8.08 8.42 9.15 VOLUME 4.3 2.5 1.6 2.8 OPEN-CLOSE 0.66 0.73 0.78 -0.55 (0.66+0.73+0.78)/0.55 RS = 3.94 RSI= 3.94/(1+3.94)*100 Y = RSI = 79.87

X= MA = Vol Avg = 8.3 2.8 Total data set avg = 1.54 + point 81.61% increase

2.b)Repeat for other months also


Month August June May April March November July February January December MA 8.30 8.97 10.73 14.54 13.74 11.79 10.61 12.32 10.44 12.78 RSI 79.87 85.81 74.09 55.35 45.94 24.24 42.53 31.50 83.97 28.05 +/-VA + + + + + + -

October

8.81

31.03

3. Support Vectors
Support vectors are calculated by using Euclidean distance formula In the Euclidean plane, if X = (x1, x2) and Y = (y1, y2) then the distance is given by, D (X,Y) = ( x1 - y1 ) 2 + ( x2 - y2 ) 2 where X and Y are positive points and negative points of Moving Average and Relative Strength Index

+ive point
8.30,79.87

-ive point
10.61,42.53 12.32, 31.50 10.44, 83.97 12.78, 28.05 8.81, 31.03

distance
37.44 48.53 4.62 52.01 48.84 43.31 54.41 2.35 (min) 57.88 54.77

8.97, 85.81

10.61,42.53 12.32, 31.50 10.44, 83.97 12.78, 28.05 8.81, 31.03

Others are also calculated and min is found S1= 8.97,85.81 S2= 10.44,83.97 8.97 10.44 S1 = 85.81 S2 = 83.97 1 1

3.b)Finding 1 and 2
1 S1 S1 + 2 S1 S2 = + 1 1 S1 S2 + 2 S2 S2 = - 1 7444.81 1 + 7300.11 2 = +1 7300.11 1 + 7160.95 2 = -1 1 = -0.71; 2 = 0.73

4. Construction of Hyperplane
Hyperplane is constructed using the equation , y = wT x + b Where, w = i si w represents normal vector to hyperplane x represents training patterns b represents threshold represents coefficients s represents support vector values are calculated by using the equation

y = wT x + b wT = i Si = 1 s1 + 2 s2 b = 0.02 y= 1.25 x + 0.02 hyperplane 0.37

5. Testing set
MA=x1= 12.44, RSI =x2= 50 to predict Volume avg y= 1.25 T x + 0.02 hyperplane 0.37

Y = 34.05 + 0.02 = 34.07 predicted value (vol avg) Overall avg = 18 Increased % = (34.07-18 )/18 = 89%

Multi SVM
Support vector machine is a powerful tool for binary classification, capable of generating very fast classifier functions following a training period

To extend binary class scenario to multi-class scenario, decompose an M-class problem into a series of two-class problems

Approaches for Multi class SVM


Multiclass ranking SVMs, in which one SVM decision function attempts to classify all classes One-against-all classification, in which there is one binary SVM for each class to separate members of that class from members of other classes Pairwise classification, in which there is one binary SVM for each pair of classes to separate members of one class from members of the other Binary decision tree classification

Classes - 100 to -50, -50 to 0, 0 to 50, 50 to 100


-100 to -50 -50 to 0 One against all

0 to 50

50 to 100

pairwise

BDT
-100 to 100

-100 to 0

0 to 100

-100 to -50

-50 to 0

0 to 50

50 to 100

7. Conclusion - Big Data Is About


Tapping into diverse data sets
Finding and monetizing unknown relationships Data driven business decisions

Big Data in Action


DECIDE
ACQUIRE

Make Better Decisions Using Big Data


ANALYZE ORGANIZE

Big Data in Action


DECIDE
ACQUIRE

Acquire all available data

ANALYZE

ORGANIZE

Big Data in Action


DECIDE
ACQUIRE

Organize and distill big data using massive parallelism


ANALYZE ORGANIZE

Big Data in Action


DECIDE
ACQUIRE

Analyze all your data, at once

ANALYZE

ORGANIZE

Big Data in Action


DECIDE
ACQUIRE

Decide based on real-time big data

ANALYZE

ORGANIZE

Summary
Big data NoSQL Hadoop Case Study

Questions

Das könnte Ihnen auch gefallen