Sie sind auf Seite 1von 16

3/24/2013

Big Data
Jason Albert
University of Pennsylvania

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

Big Data

PERSPECTIVES
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

3/24/2013

What is Big Data?


high volume, velocity and/or variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. (Gartner)

1 Terabyte = 1024 Gigabytes 1 Petabyte = 1024 Terabytes 1 Exabyte = 1024 Petabytes 1 Zettabyte = 1024 Petabytes

1 ZB (1,099,511,627,776GB) * 7.9 = 8,686,141,859,430GB


Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

How do we handle Big Data?


MAD Information Management is the approach: Must be Magnetic, attracting all data sources Must be Agile for easy accommodation of data at a rapid pace Must provide sophisticated statistical methods for its Deep data repository
Why is MAD a departure from traditional Data Warehouse?

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

3/24/2013

What is the Scope of the Solution?


An End to End Solution must be Considered:

Consume: Volume, Velocity, Variety Store: Gigabytes, Terabytes, Petabytes Process: Cluster, Classify, Predict Present: Visualize, Interact, Evaluate

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

Perspectives on Big Data


Does it handle Big Data?

Volume Velocity Variety Magnetic Agile Deep Consume Store Process Present
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

Is it considered MAD?

Is it an End-to-End Solution?

3/24/2013

Options to Consider
Two promising options with low market penetration (Gartner) MapReduce and alternatives In-memory Computing

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

Big Data

MAP REDUCE
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

3/24/2013

Hadoop = MapReduce + HDFS


Open Source, Batch Oriented, Data Intensive general purpose framework for creating distributed applications that process big data

i.e. Volume, Velocity, Variety

Hadoop Distrbuted File System (HDFS) Data distributed and replicated over multiple systems Block oriented MapReduce Map function processes intermediate key/value pairs Reduce function merges intermediate values Facilitates parallel processing of multi-terabytes of data on large clusters of commodity platforms
Scale Out Fully depreciated Repurposed Low Cost
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

MapReduce Workflow
$ hadoop jar wordcount.jar WordCount /usr/input /usr/output

1. 2.

Input data is distributed Map Tasks work on a split of data


Map(key, value) for each word x in value: output.collect(x,1)

1.

2.

3. 4. 5.

Mapper outputs intermediate data Data is exchanged between nodes Intermediate data of same key goes to same reducer
Reduce(keyword, (listOfValues)) for each x in (listOfValues): sum += x; output.collect(keyword, sum);

3.

Jack be nimble, Jack be quick, Jack jump over the candlestick. (0, "Jack be nimble,") (15, Jack be quick") (28, " Jack jump over the candlestick") (Jack, 1), (be,1), (nimble,,1), (Jack,1), (be, 1),(quick,, 1),
(Jack,1), (jump, 1),(over, 1),(the, 1), (candlestick., 1)

4. 5.

6.

6.

Reducer output is stored

(Jack, (1,1,1)), (be, (1,1)), (nimble,,(1)), (quick, (1)), (jump, (1)),(over, (1)), (the, (1)), (candlestick., (1)) (Jack, 3), (be, 2), (nimble,,1), (quick, 1), (jump, 1), (over, 1), (the, 1), (candlestick., 1)

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

10

3/24/2013

Scale-Out: MapReduce + HDFS

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

11

Case Study: Recommendations


1) 9 TB of W3C Extended Log File Format data

2) MapReduce program: sessionExtractor


Session SDF92MGSLOK4M23K ASD90K23MOLFWQIE Person B041Q3EV EM9IU67Y Person N23KFMWE

Example: LinkedIn People you may know Application Behavior Analytics Risk & Fraud Analysis Social Network "Connectedness Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

Text Analysis Regressions (Financial)

12

3/24/2013

Supplemental Case Study


Product Sentiment Analysis over Time
1 Month of Twitter Feeds and Opinion Boards onto HDFS Process using Word Count example of Positive and Negative words associated with a Product over time

This type of analysis is being done with some success

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

13

http://techcrunch.com/2012/05/18/study-twitter-sentiment-mirrored-facebooks-stock-price-today/ http://www.cs.ucr.edu/~vagelis/publications/wsdm2012-microblog-financial.pdf

MapReduce is Different
MapReduce handles processing differently: Distributed Programming Fault Tolerant MapReduce handles modeling differently: Schema-less Orientated toward exploration and discovery MapReduce handles data differently: Mostly unstructured data objects Vast number of attributes and data sources Data sources added and/or updated frequently External References Quality is unknown http://developer.yahoo.com/hadoop/
http://code.google.com/edu/parallel/mapreduce-tutorial.html Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

14

3/24/2013

MapReduce
handle Big Data? considered MAD?
Magnetism Agile MapReduce requires algorithm development Deep and End to End Solution?

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

15

Big Data

IN-MEMORY COMPUTING
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

3/24/2013

In-Memory Computing
Overview
All relevant structured data in-memory Cache aware memory organization (current bottleneck between CPU and main memory) Data partitioning for parallel execution
Current Methodology
Computation
Application Stack

Leveraging current innovations in Hardware & Software to move computations into the Database

Optimized for disk access on platforms with limited main memory and slow disk I/O.

Database Stack

Computation

Future Methodology
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

17

In-Memory Workflow
In-memory computing applies a combination of: Optimization: for Query Pruning and Data Distribution Execution: SQL statement plan for computational parallelization Stores: Column store with partitioning/compression (5-30x ratio) Persistence: Temporal Tables and MVCC

IBM x3850 x5 QPI Scaling Or Max5 Tray 2,3,4TB RAM 2-4 CPU @ 10 Cores/each > 4 TB @ 8x HDD

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

http://ark.intel.com/

18

3/24/2013

Scale-Out Strategy for In-Memory

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

19

Capturing and Presenting


Data Provisioning
IM-DBMS does not currently accommodate transaction workloads Trigger Replication new transactions to replicate to an in-memory DB facilitating real time operational analysis, planning, and simulation. Extraction using ETL (Extract, Transform, Load) tools with a large variety of external and internal source system support handles other data sources in near real-time but require job scheduling

e.g. SAP HANA

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

20

10

3/24/2013

Case Study: Sales Analysis


1) Load 1.1 Billion PoS in < 1 sec 3) Drill Down Into Category < 1 sec

2) Identify Top Selling Categories 4) Plan/Actuals as Schema & Visualize

Link to Video: PoS from HANA using Business Objects Explorer

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

21

Examples of Performance Gains


Report on Product Dimensions
120 million line items Standard ERP solution: several minutes on pre-aggregated dataset; more for drilldown In-Memory: less than 1 second on line item level data; minute delay for drilldown

Genome Analysis:
Optimized Data Warehouse: Sequence Alignment 81 minutes + Variant Calling: 65 minutes In-Memory: Sequence Alignment 15 minutes + Variant Calling 19.5 minutes (6.5 min estimated) Approximately 2hr savings
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

22

11

3/24/2013

In-Memory Computing
handle Big Data? considered MAD?
Magnetism Unstructured data still requires pre-processing Agile Deep Unsupervised Supervised an End to End Solution?

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

23

Big Data

HDFS + MAP REDUCE + IN MEMORY


Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

12

3/24/2013

Case Study: Recommendations


1) 9 TB of W3C Extended Log File Format data

2) MapReduce program: sessionExtractor


Session SDF92MGSLOK4M23K ASD90K23MOLFWQIE Product B041Q3EV EM9IU67Y Product N23KFMWE

18M Records

Hadoop-HANA Connector
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

25

Scale-Out: MapReduce + HDFS

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

Recall this slide as the Foundation

26

13

3/24/2013

+ Case Study: Predictive Analysis


1) Add Connection Details to all Data Reader Component 4) Explore Outcome

4) K-Means Cluster of Sessions

2) Retrieves records 3) Join 1.1B PoS records to Session Data 5) Write back to database for persistence 6) Use to provide Recommendations for Future Website Visitors
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

27

Scale-Out Strategy for In-Memory

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

Recall this slide as the Foundation

28

14

3/24/2013

Better together
handle Big Data? MapReduce Enables Magnetism
preprocesses unstructured data Data Provisioning

In-Memory Enables Agility


Replication Extraction

Both MapReduce and In-Memory Enable Deep Analysis


During MapReduce preprocessing Unsupervised & Supervised for InMemory

an End to End Solution?

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

29

SAP HANA + Intel Distribution of Hadoop


This is New News

February 27, 2013


http://www.sap.com/corporate-en/news.epx?PressID=20498

Jason P. Albert, University of Pennsylvania


jasonalb@wharton.upenn.edu

30

15

3/24/2013

MAD Improvement Focus


Transformative potential in five domains
U.S. Healthcare E.U. Public Sector administration Retail Manufacturing Personal Location

Most significant constraint: Shortage of talent to take advantage of the insights gained from large datasets
Deep analytical talent with technical skills in statistics to provide insights Data-savvy analysts to interpret/challenge/base decisions on results Support personnel who develop/implement/maintain the architecture Big data: The next frontier for innovation, competition, and productivity McKinsey Global Institute
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

31

Big Data

QUESTIONS?
Jason P. Albert, University of Pennsylvania
jasonalb@wharton.upenn.edu

16