AnilVasudeva NextGen Storage Big Data

NextGen Infrastructure for Big Data
Anil Vasudeva, President & Chief Analyst, IMEX Research

Author: Anil Vasudeva, IMEX Research
2012 Storage Networking Industry Association. All Rights Reserved. Source: IMEX Research Big Data Industry Report
2011-12
2 2
SNIA Legal Notice
The material contained in this tutorial is copyrighted by the SNIA and author unless
otherwise noted.
Member companies and individual members may use this material in presentations
and literature under the following conditions:
Any slide or slides used must be reproduced in their entirety without
modification
The SNIA must be acknowledged as the source of any material used in the
body of any document containing material from these presentations.
This presentation is a project of the SNIA Education Committee.
Neither the author nor the presenter is an attorney and nothing in this
presentation is intended to be, or should be construed as legal advice or an opinion
of counsel. If you need legal advice or a legal opinion please contact your attorney.
The information presented herein represents the author's personal opinion and
current understanding of the relevant issues involved. The author, the presenter,
and the SNIA do not assume any responsibility or liability for damages arising out of
any reliance on or use of this information.
NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

2011-12
3 3
Abstract
This session will appeal to Business Planning, Marketing, Technology System Integrators and Data Center
Managers seeking to understand the drivers behind the demand for and rise of Big Data.
Abstract
The internet has spawned an explosion in data growth in the form of data sets, called Big Data, that are
so large they are difficult to store, manage and analyze using traditional RDBMS which are tuned for
Online Transaction Processing (OLTP) only. Not only is this new data heavily unstructured, voluminous
and streams rapidly and difficult to harness but even more importantly, the infrastructure cost of HW and
SW required to crunch it using traditional RDBMS, to derive any analytics or business intelligence online
(OLAP) from it, is prohibitive.
To capitalize on the Big Data trend, a new breed of Big Data technologies (such as Hadoop and others)
many companies have emerged which are leveraging new parallelized processing, commodity hardware,
open source software and tools to capture and analyze these new data sets and provide a
price/performance that is 10 times better than existing Database/Data Warehousing/Business Intelligence
Systems.
Learning Objectives
The presentation will illustrate the existing operational challenges businesses face today using RDBMS
systems despite using fast access in-memory and solid state storage technologies. It details how IT is
harnessing the emergent Big Data to manage massive amounts of data and new techniques such as
parallelization and virtualization to solve complex problems in order to empower businesses with
knowledgeable decision-making.
It lays out the rapidly evolving big data technology ecosystem - different big data technologies from
Hadoop, Distributed File Systems, emerging NoSQL derivatives for implementation in private and hybrid
cloud-based environments, Storage Infrastructure Requirements to Store, Access, Secure, Prepare for
analytics and visualization of data while manipulating it rapidly to derive business intelligence online, to run
businesses smartly.
2011-12
4 4
Big Data in IT Industry Roadmap
Cloudization
On-Premises > Private Clouds > Public Clouds
DC to Cloud-Aware Infrast. & Apps. Cascade migration to SPs/Public Clouds.
Integrate Physical Infrast./Blades to meet CAPSIMS
IMEX

Cost, Availability, Performance, Scalability, Inter-operability, Manageability & Security
Integration/Consolidation
Standard IT Infrastructure- Volume Economics HW/Syst SW
(Servers, Storage, Networking Devices, System Software (OS, MW & Data Mgmt. SW)
Standardization
Virtualization
Pools Resources. Provisions, Optimizes, Monitors
Shuffles Resources to optimize Delivery of various Business Services
Automatically Maintains Application SLAs
(Self-Configuration, Self-Healing
IMEX
, Self-Acctg. Charges etc.)
Automation
IT Industry
Roadmap
Analytics BI
Predictive Anal ytics - Unstructured Data
From Dashboards Visualization to Prediction Engines using Big Data.
2011-12
5 5
NextGen IT Infrastructure
Enterprise VZ Data Center
On-Premise Cloud
Home Networks
Web 2.0
Social Ntwks.
Facebook,
Twitter, YouTube
Cable/DSL
Cellular
Wireless
Internet ISP
Core
Optical
Edge
ISP
ISP
ISP
ISP
ISP
Supplier/Partners
Remote/Branch Office
Public CloudCenter
Servers
VPN
IaaS, PaaS
SaaS
Vertical
Clouds
ISP
Tier-3
Data Base
Servers
Tier-2 Apps
Management Directory Security Policy
Middleware Platform
Switches: Layer 4-7,
Layer 2, 10GbE, FC Stg.
Caching, Proxy,
FW, SSL, IDS, DNS,
LB, Web Servers
Application Servers
HA, File/Print, ERP,
SCM, CRM Servers
Database Servers,
Middleware, Data
Mgmt.
Tier-1
Edge Apps
FC/ IPSANs
Request for data from a remote client to a Data Center or Cloud crosses a myriad of
systems and devices. Key is identifying bottlenecks & improving performance
2011-12
Harnessing Big Data for Business Insights
6
Majority of data growth is being
driven by unstructured data
and billions of large objects
Information is at the center of
New Wave of opportunity
80% of worlds data is unstructured
driven by rise in Mobility devices,
collaboration machine generated data.
Data
Sources
Big Data
Infrastructure
Business
Insights
2011-12
Unstructured Big Data can provide Next
Gen Analytics to help businesses make
informed, better decision in:
Product Strategy
Targeting Sales
J ust-In-Time Supply-Chain Economics
Business Performance Optimization
Predictive Analytics &
Recommendations
Country Resources Management
Corporate Need: Business Perf... Optimization
7
2011-12
Corporate Need: Real Time Analytics
8
2011-12
Corporate Need: Business Insights
.

Store
Information Exploding
Volume: Digital Content doubling every 18
months. Velocity: >80% growth driven from
unstructured data. Variety: sources of data
changing
A unified information/content
storage methodology that enables
users to manage the volume, velocity and
variety of information from multiple sources

Manage
Complexity in " managing"
information. - Need to classify,
synchronize, aggregate, integrate, share,
transform, profile, move, cleanse, protect,
retire
A solution portfolio of tools and
services to manage all types of
information in a hybrid storage environment

Analyze
Current solutions limited to BI tools
focused on structured and lagging
information
Build/buy packaged Real-Time
Predictive Anal ytical Solutions for
unstructured analytics tools

Collaborate
Multiple access methods needed to
meet needs of a diverse audience.
Centralized share, collaborate and
act on insights anytime, anywhere on any
device.

Model/ Adapt
Ability to understand how the
information impacts the business.
How to transfer to action.
Model Information on current
operations w/potential strategy
impact. Leverage Tech. to adapt.
Item Issue Solution
9
2011-12
Opportunity: Converting Big Data Deluge
into Predictive Analytics & Insights
Personal Location Services Data Generated by TB/Year
Navigation Devices 600
Navigation Apps on Phones 20
Smart Phone Opt-In Tracking 1000
Geo Targeted Ads 20
People Locator (Emergency Calls/Search..) 10
Location based Services (e.g.Games) 5
Other 45
Total (Est.) 1700
Big Data Predictive Analytics
10
2011-12
Issues with Existing RDBMS
Key Issues with RDBMS Technologies

Handling Mixed Unstructured Data
- RDBMs dont handle non-tabular data
(Notorious for doing a poor job on recursive data structure)
Legacy Archaic Architecture
- RDBMS dont parallelize well to accommodate
commodity HW clusters
Speed
- Seek time of physical Storage has not kept pace with
network speed improvements
Scale
- Difficult to scale-out RDBMS efficiently Clustering
beyond few servers notoriously hard
Integration
- Data processing tasks need to combine data from non-
related sources, over a network
Volume
- Data volumes have grown from 10s GB >100s TB >
PBs in recent years. Existing Tabular RDBMS cant
handle such large DBs
11
Fault
Tolerance
Availability
Deep
Insights
EU Adhoc
Analytics
Cost
Unstructured
Data
Latency
Scalability

Big Data
Anal ytics

2011-12
Issues with Existing RDBMS
Present RDBMS struggling to Store & Analyze Big Data
12
2011-12
Big Data - Database Solutions
13
2011-12
Big Data The New Face of DBs

Big Data Paradigm - The New face of DB Systems
Adopts Schema-Free Architecture
Can do away with Legacy Relational DB Systems
Some data have sparse attributes, do not need relational property
Key Oriented Queries
Some data stored/retrieved mainly by primary key, w/o complex joins
Trade-off of Consistency, Availability & Partition Tolerance
Scale Out, not up, - Online Load balancing cluster growth
14
2011-12
Analytics The Next Frontier in IT
15
2011-12
Key Innovations: HW Technologies
16
2011-12
17
Key Innovations Solid State Storage
Source: IMEX Research SSD Industry Report
2011
Note: 2U storage rack, 2.5 HDD max cap =400GB / 24 HDDs, de-stroked to 20%, 2.5 SSD max cap =800GB / 36 SSDs
0
300
600
2009 2010 2011 2012 2013 2014
U
n
i
t
s

(
M
i
l
l
i
o
n
s
)

0
8
I
O
P
S
/
G
B

IOPS/GB HDDs
HDD
SSD
17
Key to Database performance are random IOPS. SSDs outshine HDD in IO price/performance
a major reason, besides better space and power, for their explosive growth.
Storage - IOPS/GB & Price Erosion - HDD vs. SSDs
2011-12
Innovations DB SW Technologies
Tech Innovation 1985 1990 1995 2000 2005 2010 2015
OLTP Transactions
DB SW
Rows Locking Optimizer Parallel Query Clustering XML Grid
Open Source /
Hadoop
OLAP- Analytics
DB SW
Indexing Partitioning Columnar
Materialized
View
Bit Mapped
Index
In-Memory Query Binding
Hardware 32 bit SMP NUMA 64 bit
Multi-core/
Blades
Flash MPP
Big Data Multi-core
Columnar
In-Memory
MPP
Visualization
OLTP Database Innovation Progress
0.1
1
10
100
1000
10000
100000
1985 1990 1995 2000 2005 2010 2015
0.01
0.1
1
10
100
1000
10000
$/TPMc
TPMc/Processor
$

/

T
P
M
c
T
P
M
c

/

P
r
o
c
e
s
s
o
r
Big Data: Analytics DB Technology Impact
0
0
1
10
100
1,000
10,000
100,000
1985 1990 1995 2000 2005 2010 2015
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
100000
.1
.01
DW Size TB
$/GB
18
2011-12
Big Data - Key Requirements
Types of Data Organizations Analyze
59%
44%
69%
64%
41%
33%
51%
28%
46%
36%
36%
18%
18%
21%
8%
3%
3%
68%
68%
37%
23%
33%
34%
26%
32%
21%
15%
11%
15%
11%
9%
3%
6%
5%
Customer/member data
Transactional data from applications
Application Logs
Other Types of Event Data
Network Monitoring/Network Traffic
Online Retail Transactions
Other Log Files
Call Data Records
Web Logs
Text data from social media and online
Search logs
Trade/quote data
Intelligence/defense data
Multimedia (audio/video/images)
Weather
Smartmeter data
Other (please specify)
Hadoop
Non Hadoop
19
2011-12
Big Data Architectural Goals
Big Data
Platform
Meet Enterprise Criterion
Meet Requirements of V3
Analyze Data in Native Format
20
2011-12
21
Big Data - Market Requirements
Unified system: Pre-integrated for Ease of Installation and Management
Platform Large Scale Indexing Pre-integrated using Hadoop Foundation,
Integrated Text Analytics - Address Unstructured Data
Usability - User Friendly Admin Console including HDFS Explorer, Query
Languages
Enterprise Class Features Provisioning, Storage, Scheduler, Advance Security
Supports search-centric, document-based XML data model
o store documents within a transactional repository.
Schema-Free:
o No advance knowledge of the document structure (its "schema") needed
o Index words and values from each of the loaded documents together with
its document structure.
Standard commodity hardware leveraged
2011-12
Big Data Market Requirements
Architectural
Shared-nothing clustered DB architecture
o programmable and extensible application servers.
Support massive scalability to petabytes of source data
Support open-source XQuery- and XSLT-driven architecture
Simple to Deploy, Develop and Manage (UI & Restful Interface)
Support extreme mixed workloads - a wide variety of data types including
arbitrarily hierarchical data structures, images, waveforms, data logs etc.
Support thousands of geographically dispersed on--line users and
programs executing variety of requests from ad hoc queries to strategic analysis
Loading data before declaring or discovering its structure
Load data in batch and streaming fashion
Integrate data from multiple sources during load process at very high rates
Spread I/O and data across instances
Provide consistent performance with linear cost
Leverage Open Source SW Lo Costs, Multiple Sources, Hadoop Foundation Tools
Connectivity with Oracle DB, Teradata Warehouse, J DBC Connectivity,
22
2011-12
Real Time Analytics Execution
Execute streaming analytic queries in real time on incoming load data
Updating data in place at full load speeds
Scheduling and execution of complex multi-hundred node workflows
J oin a billion row dimension table to a trillion row fact table without pre-
clustering the dimension table with the fact table
Performance
Analyze data, at very high rates >GB/sec
Predictable Sub-ms response time for highly constrained standard SQL
queries
Availability
Ability to configure without any single point of failure
Auto-Failover Extreme High Availability
o Automated failover and process continuation without operational
interruption when processing nodes fail
23
Big Data - Market Requirements
2011-12
Big Data Product Metrics Choices
Big Data
-
Product
Metrics
Data Set Size
PB
TB
GB
Data Structure
Transaction
Machine
Unstructured
Other
Access/Use
Transaction
Search
Analytics
Parallel
Processing
Appliance
Cluster < 1K
Cluster > 1K
Memory
In-Memory
Flash
DB Technique
Columnar
Zero Sharing
No SQL
Data
Cataloging
SW
Text
Image
Audio
Video
24
2011-12
Advantage: Big Data Products
Characteristic Legacy Paradigm Big Data Paradigm
Structure Transactional/Corporate Unstructured/Derivative/Internet
Mode Data Collection Data Analysis
Focus Find Answers Find Questions
Facility Reportive / What Happened? Analytic / Why did it Happen?
Predictive / What will Happen Next?
Opportunity Very Small Growth Massive Growth
Players Legacy Players Agile Start Ups, well funded
Impact Analyze Existing Businesses Create New Businesses
25
2011-12
Characteristic Traditional RDBMS Big Data/MapReduce
Data Size GB PB
Access Interactive Batch/Near Real-Time
Latency Low High
Data Updates Read & Write Many Times Write Once Read Many Times
Schema/Structure Static Schema Dynamic Schema
Language SQL UQL/Procedural (J ava,C++..)
Integrity High Not 100%
Works Well for Process Intensive J obs Data Intensive J obs
Works Well w Data Size Gigabytes Petabytes
Data/Processing
Interactions
Low Latency/High BW precursor to
success. Ntwk. BW can be a
bottleneck causing nodes to be idle
Sends Code to Data, instead of
Sending Data to other Nodes
(Requiring Lower BW in Cluster)
Fault Tolerance Coordinating Processes with Node
Failures a challenge
Fault Tolerant for HW/SW Failures
Access Interactive Batch/Near Real-time
Scaling Non-linear Linear
Pgm-Distribution of Jobs Difficult Simple & Effective
26
Advantage: Big Data Products
2011-12
Big Data Ecosystem
27
2011-12
Big Data Stack
Collector
Admin Center
Data
Importer
Data
Importer
RHive

Enterprise
PerfMon, Query
Plan
Hive Workflow
Oracle-to-Hive
Search
Rest/
JSON
API
Data
exporter
Databases
Advanced
Analytics
OLAP
Server
OLTP
Server
ETL
Ad-hoc query
Data
Sources

Data Store
(Hadoop)
Oracle
IBM
Teradata
Real-time
Queries
DBA
Streaming
Data
Devices
Analytics Platform
Data
Sources
Applications
Merging Hadoop innovations into Nextgen DBMS
28
2011-12
Hadoops Fit in Enterprise Stack
29
2011-12
Big Data - Hadoop Architecture
30
Data Flow
(Pig)
SQL
(Hive)
Distributed Computing Framework
(MapReduce)
Metadata
(HCatalog)
Column-Stg
(HBase)
C
o
o
r
d
i
n
a
t
i
o
n

(
Z
o
o
k
e
e
p
e
r
)

M
a
n
a
g
e
m
e
n
t

(
H
M
S
)

Hadoop Distributed File System
(HDFS)
Programming
Languages
Computations
Object
Storage
Tabular
Storage
2011-12
Big Data Connectors to EDW/BI
Servers
Operating System
Hypervisor/VMs
Big Data Storage Framework
(HDFS)
Big Data Processing Framework
(MapReduce)
Big Data Access Framework

Pig
Hive Sqoop

Big Data
(Connectors)
Big Data Orchestration Framework

HBase Avro Flume ZooKeeper

BI APPLICATIONS
(Query, Analytics, Reporting, Statistics)

EDW
B
a
c
k
u
p

&

R
e
c
o
v
e
r
y

M
a
n
a
g
e
m
e
n
t

S
e
c
u
r
i
t
y

Network
BI Framework - Interoperable with Enterprise Data Warehousing
31
2011-12
Big Data Infrastructure Map Reduce
32
Map Reduce
A Distributed Computing Model
Typical Pipeline:
Input>Map>Shuffle/Sort>Reduce>Output
Easy to Use , Developer writes few functions,
Moves compute to Data
Schedules work on HDFS node with data
Scans through data, reducing seeks
Automatic Reliability and re-execution on failure
Column-Stg
(HBase)
Data Flow
(Pig)
SQL
(Hive)
Metadata
(HCatalog)CV
C
o
o
r
d
i
n
a
t
i
o
n

(
Z
o
o
k
e
e
p
e
r
)

M
a
n
a
g
e
m
e
n
t

(
H
M
S
)

Hadoop
Distributed
File System
(HDFS)
Metadata
(HCatalog)CV
MapReduce
Computing
Framework)
2011-12
HDFS Architecture
Actively Maintaining High Availability
Persistent Namespace Metadata & Journal
Name Node
Block
Map
Namespace
State
NFS
NFS
N
a
m
e
s
p
a
c
e

Hierarchical Namespace
File name >> Block IDs >> Block Locations
1. Replicate
3. Block
Received
Periodically Check
Block Checksums
0. Bad/
Lost Block
Block ID >> Data
JBOD
Data Node
b1
b3
b2
JBOD
Data Node
b1
b3
b2
JBOD
Data Node
b1
b3
b2
JBOD
Data Node
b1
b3
b2
Horizontally Scale I/O & Storage
JBOD
Data Node
b1
b3
b2
JBOD
Data Node
b1
b3
b2
JBOD
Data Node
b1
b3
b2
JBOD
Data Node
b1
b3
b2
2. Copy
Block ID >> Data
Horizontally Scale I/O & Storage
Name Node
Block
Map
Namespace
State
Persistent Namespace Metadata & Journal
NFS
NFS
N
a
m
e
s
p
a
c
e

Heartbeats & Block Reports
HDFS
Immutable File System Read, Write, Sync/Flush No random writes
Storage Server used for Computation Move Computation to Data
Fault Tolerant & Easy Management Built In Redundancy, Tolerates Disk & Node Failure, Auto-Managing
addition/removal of nodes, One operator/8K nodes
Not a SAN but high bandwidth network access to data via Ethernet
Used typically to Solve problems not feasible with traditional systems: Large Storage Capacity >100PB raw,
Large IO/computational BW >4K node/cluster, scale by adding commodity HW, Cost ~$1.5/GB incl. MR cluster
Big Data Infrastructure HDFS
2011-12
Hadoop Distributed File System
HDFS Architecture
HDFS Characteristics
Based on Google GFS (Google
File System)
Redundant Storage for massive
amounts of data
Data is distributed across all
nodes at load time efficient
MapReduce processing
Runs on commodity hardware
assumes high failure rate for
components
Works well with lots of large files
Built around Write once Read
many times
Large Streaming Reads Not
random access
High Throughtput more important
than low latency
34
2011-12
Hadoop Architecture - Overview
Hadoop Data Processing Architecture
35
2011-12
Key Technologies Required for Big Data
Key Technologies Required for Big Data
Cloud Infrastructure
Virtualization
Networking
Storage
o In-Memory Data Base (Solid State Memory)
o Tiered Storage Software (Performance Enhancement)
o Deduplication (Cost Reduction)
o Data Protection (Back Up, Archive & Recovery)
36
2011-12
Cloud Infrastructure for Big Data
37
Examples
eMail - Yahoo!,Google
Collaboration - Facebook,Twitter
Bus.Apps - SalesForce, GoogleApps, Intuit

Examples
Amazon EC2
Force.com
Navitaire

Examples
Amazon S3
Nirvanix

Infrastructure HW &
Services
- Servers, Network, Storage
- Management, Reporting
SaaS
PaaS
IaaS
Platform Tools & Services
- Deploy developed platforms
ready for Appl ication SW on
Cloud Aware Infrastructure
Software-as-a-Service
- Servers, Network, Storage
- Management, Reporting

Service
Providers
Examples
Public - BT, Telstra, T-Systems France Telecom
Private
Hybrid IBM/Cloudburst,

Cloud Services Providers
Public Mutitenancy,OnDemand
Private - On Premises, Enterprise
Hybrid Interoperable P2P
2011-12
Platform Tools & Services
Operating Systems
Cloud Computing
Public Cloud Service
Providers
Private Cloud
Enterprise
App
S
L
A

SaaS Applications
..
.
N
e
t

P
y
t
h
o
n

E
J
B

R
u
b
y

P
H
P

..
PaaS
IaaS
SaaS

Virtualization

Resources (Servers, Storage, Networks)
Hybrid
Cloud
App
S
L
A

App
S
L
A

App
S
L
A

App
S
L
A

M
a
n
a
g
e
m
e
n
t

Cloud Infrastructure for Big Data
Applications SLA dictates the Resources Required to meet specific
requirements of Availability, Performance, Cost, Security, Manageability etc.
2011-12
Private Cloud Requirements for Big Data
Public
Cloud
Storage
Costs
O
p
e
r
a
t
i
o
n
a
l

F
l
e
x
i
b
i
l
i
t
y

App
Silos
VZ
Private
Cloud
Storage
Automation
Automated Provisioning
- Moving Data & Processes Seamlessly
Allocation, Self Tuning of Resources to
meet Workload Requirements
Self-
Service
Access Resources on demand to
speed deployment and delivery
Scale Resource Up/down
to optimize their usage,
Release when not needed
Service Catalog
Choose pre-defined IT Services/user/dept.
Define SLA to efficiently meet services
Service
Analytics
Monitor & Analyze Usage for
Charge back
Interactively auto-tune
performance
Availability with SLA
requirements
Private
Cloud
Storage
2011-12
Virtualization: Workloads Consolidation
Source: Dan Olds & IMEX Research 2009
A single server 1.5x larger than
standard 2-way server will handle
consolidated load of 6 servers.
VZ manages the workloads +
important apps get the compute
resources they need automatically
w/o operator intervention.
Physical consolidation of 15-20:1
is easily possible
Reasonable goal for VZ x86
servers 40-50% utilization on
large systems (>4way), rising as
dual/quad core processors
becomes available
Savings result in Real Estate,
Power & Cooling, High Availability,
Hardware, Management

2011-12
Virtualization: TCO Savings
$-
$4,000
$8,000
$12,000
$16,000
w/o VZ w VZ
Provisioning
Hardware
SAN
Network
Power & Cooling
DC Real Estate
Disaster Recovery
Downtime
C
o
s
t

o
v
e
r

3

y
e
a
r
s

995 Pre-Virtualization (VZ) Servers 78 VZ
Servers
VZ SW
&
Support
F
o
r

T
C
O

A
n
a
l
y
s
i
s
,

E
M
:

i
m
e
x
@

i
m
e
x
r
e
s
e
a
r
c
h
.
c
o
m

(
4
0
8
)

2
6
8
-
0
8
0
0

2011-12
Storage Infrastructure for Big Data
42
Storage
Efficiency
Virtualization
Mapping P > V, VM Management
Performance
In-Memory DB, Auto-Tiering-SSD/HDD
Costs Reduction
Thin Provisioning
Deduplication
Availability
RAID/Auto recover HA, Snapshots, CDP, Cloning, DRS
Security
Encryption/DLP
Service
Efficiency
Storage -as-a Service
Service Catalogs by Workloads etc.
Policy Infrastructure
Service Level Attributes
Service Measurements
Performance Analytics
IOPS/Response Time, Bandwidth
Automation
Unified SAN/NAS Protocols
Auto learning Workload Forensics
Provisioning to Match Workloads
Assured Auto recovery
2011-12
Storage Architecture - Impact from VZ
2011
43
Replication
RAID 0,1,5,6,10
Virtual Tape
Back Up/Archive/DR
Dat a Pr ot ec t i on
Virtualization
MAID
Deduplication
Thin Provisioning
St or age Ef f i c i ency
Auto Tiering

Virtualization (VZ)
requires Shared Storage for
- VMotion
- Storage VMotion
- HA/DRS
- Fault Tolerance

Additional Capacity
Consumed for
- VZ snapshots,
- VM Kernel etc.

2011-12
Storage Issues & Solutions
Traditional
Data Growth
S
t
o
r
a
g
e

C
o
s
t
s

R
e
d
u
c
t
i
o
n

Capacity Requirements
Snapshots ~ 75%

Thin Provisioning ~30%
DeDuplication ~ 25-95%
Auto-Tiering 65-95%
Thin Replication ~ 95%
RAID*DP ~ 40% vs. R10
Virtual Clones ~80%
CAPACITY
SAVINGS
~ xx %
Technologies Reducing Storage Costs
2011-12
Storage Architecture Impacting Big Data
Auto-Tiering System
using Flash SSDs
Data Class
(Tiers 0,1,2,3)
Storage Media Type
(Flash/Disk/Tape)
Policy Engines
(Workload Mgmt.)
Transparent Migration
(Data Placement)
File Virtualization
(Uninterrupted App.Opns.in Migration)
2011
Replication
RAID 0,1,5,6,10
Virtual Tape
Back Up/Archive/DR
Dat a Pr ot ec t i on
Storage Virtualization
MAID
Deduplication
Thin Provisioning
St or age Ef f i c i ency
Auto-Tiering
45
DRAM
Flash
SSD
Performance
Disk
Capacity Disk
Tape
2011-12
46 46
I/O Access Frequency vs. Percent of Corporate Data
SSD
Logs
Journals
Temp Tables
Hot Tables

FCoE/
SAS
Arrays
Tables
Indices
Hot Data
Cloud
Storage
SATA
Back Up Data
Archived Data
Offsite DataVault
2% 10% 50% 100% 1%
% of Corporate Data
65%
75%
95%
%

o
f

I
/
O

A
c
c
e
s
s
e
s

Data Storage: Hierarchical Usage

Source:: IMEX Research - Cloud Infrastructure Report
2009-12
2011-12
47 47
SSD Storage: Filling Price/Perf.Gaps
HDD
Tape
DRAM
CPU
SDRAM
Performance
I/O Access Latency
HDD becoming
Cheaper, not faster
DRAM getting
Faster (to feed faster CPUs) &
Larger (to feed Multi-cores &
Multi-VMs from Virtualization)
SCM
NOR
NAND
PCIe
SSD
SATA
SSD
Price
$/GB
2010-12
SSD segmenting into
PCIe SSD Cache
- as backend to DRAM &
SATA SSD
- as front end to HDD
2011-12
48 48
SSD Storage - Performance & TCO
SAN TCO using HDD vs. Hybrid Storage
14.2
5.2
75
28
0
64
0
36
145
0
0
50
100
150
200
250
HDD Only HDD/SSD
C
o
s
t

$
K
Power & Cooling RackSpace SSDs HDD SATA HDD FC
Pwr/Cool
RackSpace
SSD
HDD-
SATA
HDD-FC
SAN Performance
Improvements using SSD
0
50
100
150
200
250
300
FC-HDD Only SSD/SATA-HDD
I
O
P
S
0
1
2
3
4
5
6
7
8
9
10
$
/
I
O
P
Performance (IOPS) $/IOP
$/IOPS
Improvement
800%
IOPS
Improvement
475%

2011
2011-12
Workloads Characterization
*IOPS for a required response time ( ms)
*=(#Channels*Latency-1)

(RAID - 0, 3)
500 100
MB/sec
10 1 50 5
Data
Warehousing
OLAP
Business
Intelligence
(RAID - 1, 5, 6)
I
O
P
S
*

(
*
L
a
t
e
n
c
y
-
1
)

Web 2.0 Audio
Video
Scientific Computing
Imaging
HPC
TP
HPC
10K
100 K
1K
100
10
1000 K
OLTP
eCommerce
Transaction
Processing
Source:: IMEX Research - Cloud Infrastructure Report
2009-12
2011-12
Storage performance, management and costs
are big issues in running Databases

Data Warehousing Workloads are I/O intensive
Predominantly read based with low hit ratios on buffer pools
High concurrent sequential and random read levels
Sequential Reads requires high level of I/O Bandwidth (MB/sec)
Random Reads require high IOPS)
Write rates driven by life cycle management and sort operations
OLTP Workloads are strongly random I/O intensive
Random I/O is more dominant
Read/write ratios of 80/20 are most common but can be 50/50
Can be difficult to build out test systems with sufficient I/O characteristics
Batch Workloads are more write intensive
Sequential Writes requires high level of I/O Bandwidth (MB/sec)
Backup & Recovery times are critical for these workloads
Backup operations drive high level of sequential IO
Recovery operation drives high levels of random I/O
Workloads Characterization
2011
50
2011-12
Best Practices Storage in Big Data Apps
Goals & Implementation
Establish Goals for SLAs (Performance/Cost/Availability), BC/DR (RPO/RTO) &
Compliance
Increase Performance for DB, OLTP and OLAP Apps:
Random I/O > 20x , Sequential I/O Bandwidth > 5x
Remove Stale data from Production Resources to improve performance
Use Partitioning Software to Classify Data
By Frequency of Access (Recent Usage) and
Capacity (by percent of total Data) using general guidelines as:
Hyperactive (1%), Active (5%), Less Active (20%), Historical (74%)
Implementation
Optimize Tiering by Classifying Hot & Cold Data
Improve Query Performance by reducing number of I/Os
Reduce number of Disks Needed by 25-50% using advance compression software achieving
2-4x compression
Match Data Classification vs.Tiered Devices accordingly
Flash, High Perf Disk, Low Cost Capacity Disk, Online Lowest Cost Archival Disk/Tape
Balance Cost vs. Performance of Flash
More Data in Flash > Higher Cache Hit Ratio > Improved Data Performance
Create and Auto-Manage Tiering (Monitoring, Migrations, Placements) without
manual intervention
51
2011-12
52 52
Best Practices: I/O Forensics in Storage-Tiering
Source: IBM & IMEX Research SSD Industry Report 2011 IMEX 2010-12
LBA Monitoring and Tiered Placement
Every workload has unique I/O access signature
Historical performance data for a LUN can
identify performance skews & hot data regions
by LBAs
Storage-Tiering at LBA/Sub-LUN Level
Storage-Tiered Virtualization
Physical Storage Logical Volume
SSDs
Arrays
HDDs
Arrays
Hot Data
Cold Data
Automatic
Migration
(Policy Based)
2011-12
Best Practices: Cached Storage
App/DB
Server
LSI MegaRAID CacheCade Pro 2.0
Application Improvement over Cached
vs.HDD only
Oracle OLTP Benchmarks 681%
SQL Server OLTP
Benchmark
1251%
Neoload (Web Server
Simulation
533%
SysBench (MySQL OLTP
Server)
150%
0
373
655
0
100
200
300
400
500
600
700
All HDD Smart Flash
Cache
Persist Data
on Warpdrive
TPS
0
330
660
0
100
200
300
400
500
600
700
All HDD Smart Flash
Cache
Persist Data
on
Warpdrive
Response Time
53
2011-12
Big Data Targets: Analytics
Key Industries Benefitting from Big Data Analytics
C
A
G
R

%

(
2
0
1
0
-
1
5
)

Global IT Spending by Industry Verticals 2010-15 $B
5 Year Cum Global IT Spending 2010-15 ($B)

54
2011-12
Big Data Targets Storage Infrastructure
Data Intensity by Industry Vertical
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Banking & Financial Services
Media & Entrertainment
Healthcare Providers
Professional Services
Telecommunications
Pharma, Life Sciences & Medical Pdcts
Retail & Wholesales
Utilities
Ind'l Elex & Electrical Equipment
SW Publishing & Internet Srvcs
Consumer Products
Insurance
Transportation
Energy
Installed Terabytes/$M Rev 2010
Value Potential of Using Big Data by Data Intensive Verticals
55
2011-12
Big Data Targets: Storage Infrastructure
Data Stored by Large US Enterprises
Big Data Storage Potential
Dat a St or ed by Lar ge US Ent er pr ise
14%
12%
10%
10%
9%
6%
6%
6%
5%
4%
4%
3%
3%
3%
2%
2%
1%
Discrete Manafacturing
Government
Communications and Media
Process manufacturing
Banking
Health Care Providers
Securities & Investment Srvcs
Professional Services
Retail
Education
Insurance
Transportation
Wholesale
Utilities
Resource Industries
Consumer and Recreational
Services
Construction 967
1,312
1,792
831
1,931
370
3,866
278
697
319
870
801
536
1,507
825
150
231

Big Data Storage Potential
Data Stored by Large US Enterprises

Stored Data by Industry
(in US 2009 PB)
Stored Data TB/Firm
(>1K Employees US)
56
2011-12
Big Data Targets: Savings w Open Source
Legacy BI vs. Open Source Big Data Analytics
57
2011-12
Rise of Big Data Adoption
58
2011-12
59
Key Takeaways
Big Data creating paradigm shift in IT Industry
Leverage the opportunity to optimize your computing infrastructure with Big Data
Infrastructure after making a due diligence in selection of vendors/products, industry testing
and interoperability.
Apply best storage technologies listed in this presentation and elsewhere
Optimize Big Data Analytics for Query Response Time vs. # of Users
Improving Query Response time for a given number of users (IOPs) or Serving more users
(IOPS) for a given query response time
Select Automated Storage Management Software
Data Forensics and Tiered Placement
Every workload has unique I/O access signature
Historical performance data for a LUN can identify performance skews & hot data regions by
LBAs.Non-disruptively migrate hot data using auto-tiering Software
Optimize Infrastructure to meet needs of Applications/SLA
Performance Economics/Benefits
Typically 4-8% of data becomes a candidate and when migrated for higher performance
tiering can provide response time reduction of ~65% at peak loads. Many industry Verticals
and Applications will benefit using Big Data

2011
2011-12
60 60
Q&A / Feedback

Many thanks to the following individuals
for their contributions to this tutorial.
Source: IMEX Research

Joseph White
Anil Vasudeva

Send any questions or comments on this presentation to SNIA: tracktutorials@snia.org

AnilVasudeva NextGen Storage Big Data

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AnilVasudeva NextGen Storage Big Data

Hochgeladen von

Copyright:

Verfügbare Formate

NextGen Infrastructure for Big Data

Anil Vasudeva, President & Chief Analyst, IMEX Research

Das könnte Ihnen auch gefallen