Sie sind auf Seite 1von 44

An Executable Big Data

Storage Strategy
Lehigh University
James Young
Planning
Roy Gruver
Infrastructure
Chulin Meng
Library
Steve Anthony
Research Computing
Planning and strategy
We will present the outcome of Lehigh
University's strategic storage planning effort
which motivated disparate members of our
merged technology and library organization to
devise and execute a plan to respond to the
exponential growth in campus demand for
data storage.
Anchoring strategy
High
performance
Virtualization
Digital Library
Open
source
Research Computing
Digital Library
Cloud
services
Enduser
Desktop
Anchoring strategy
Crossing
boundaries
Collaboration
Partnerships
Understand
problem
Unabated need
Fragmented
Anticipate
needs
Access
Sync
Outcomes
Rapid increase in
performance & capacity
Decreased per/TB cost
SAN culture
Open source Lehigh cloud
Strategic investments in
external cloud services
Improved service
New partnerships
Overview
A problem that needs strategy
why storage is important
The campus landscape
forging an understanding
The case for change
three vantages
strategy is indispensable
a realistic set of executable ideas
John Kotters 8 Steps
for Leading Change
Setting a goal
Increase storage resources
four-fold to one petabyte
by summer 2015.
storage is a campus problem
more, faster, better, cheaper, now
...disparate audiences
and can you keep it longer
and put it in the cloud?
According to data management thought leader and Gartner research vice president Douglas
Laney, who characterizes big data as
high volume, high velocity, and/or high
variety information assets that require new
forms of processing to enable enhanced
decision making, insight discovery, and
process optimization
(Laney, 2012).
Big data
Moreover
The key requirements of big data storage are that is can
handle very large amounts of data and
keep scaling to keep up with growth
and that it can provide the input/output operations per
second (IOPS) necessary to deliver data to analytics tools.
Storage drivers
About Lehigh
Residential research university
Bethlehem, Pennsylvania
6900 students
$425 million budget


Libraries
Digital content lifecycle
Archives
Storage services
Teaching & Learning
Research Computing
ScienceDMZ with DTN
Collaboration portals
Infrastructure
Centralized storage & back-ups
Individual & department space
Disaster Recovery
Enterprise, Web & Mobile
Applications + data
Desktop
Storage snapshot - historical
Storage snapshot- management
Users
Unaware
Confused
Bypass LTS
Service
Fragmented
Unaligned responses
Answer is no
Capacity
Under-capacity
Unsure of future need
Growth unbound
Technology
Primarily DAS
Some NAS/SAN
Exploratory Cloud
Cost
Hard to determine
Lifecycle lacking
No reliable funding
Building a strategy
Principles:
Understanding needs to keep pace
Crossing boundaries & pooling resources
Each change is an opportunity
Reducing aversion to risk
Change - how we look at clients
Separate historical client base
No longer as different as we thought
Same client many needs
Traditional client relationships
Change - how we look at budgets
Historical budget relationships
Conscious agreement and commitment
Moved up the organization chart
Share storage platforms
Change - how we look at risk
Data Center model - low risk threshold
Reliability, performance, up-time, DR
Developed multiple models
access; performance; backup; protection
Distributed storage framework
matching storage platforms to specific data types
Change - external factors
Growth in VM environment
common store
right-sized storage allocations
internal charges for VMs
Mature products in the marketplace
many to choose from
extended distributed storage framework
cost-effective
Year Solution Used For Budget Lifecycle
2012-2013 EMC SAN
Virtual
Applications
Server None
DuraSpace Library (test) One-time N/A
2013-2014 EMC SAN Library One-time None
Pitt Supercomputing
Center
Library (test) One-time N/A
2014-2015 DropBox (for Business) Desktop & Sharing Resell on Campus N/A
Campus Cloud (Ceph) Library & Research One-time Resell on Campus
Crashplan/Ceph Desktop Backup Client license Resell on Campus
OneDrive Whatever MS 360 License
Google Drive
Whatever
Google Apps for
Education
Data Transfer Node Collaborative Research NSF
SAN Expansion Email & Applications Reallocation + 1 time
Research Computing
Increasing number of small (20-50TB) RAID
deployments. Plan to introduce more, Fall 2013.
Interest in SOHO NAS devices (Synology, Drobo,
etc). Best Buy syndrome.
Lack of interest in paying for $500/TB/yr offering.
Primary 32TB NAS device EOL.
Primary cluster storage RAID1 completely
overwhelmed.
http://upload.wikimedia.org/wikipedia/commons/a/ac/Iceberg.jpg
Mechanical Engineering
Purchased 1TB in EOL NAS.
Thinking of 36TB RAID 6 server
purchase, located in datacenter.
Backup solution is multiple 1TB
drives in lab, manual copy process.
Desire for integration into central
computer cluster.
Well established relationship with
Research Computing team.
http://sourceforge.net/p/lammps/mailman/attachment/4B3D2B50.1030907@gmail.com/3/
we all need more storage!
Why Ceph?
Mature open source project since 2004.
Unified Software Defined Storage platform.
Run on heterogeneous commodity hardware.
Scale performance with capacity.
Secure data on-site, full encryption an option.
Community and RedHat Enterprise support.
Ceph Topology
Lehigh Ceph Deployment
OSD Servers (35)
Intel E5-2630 2.4Ghz, 4 core
(2)
32GB RAM (2x16GB)
Intel X540-T2 10Gb NIC
WD Red 4TB hard drive (13)
KC300 60GB SSD (5)
LSI 2280 SAS controller, JBOD
Monitors (3)
VMs running on Ganeti cluster
2GB RAM
10GB disk
1Gb connection
A simple cost model
Mechanical Engineering
"I think the best way to convey what the Ceph storage system has done for my
research team is through an example. In one of our projects, we're looking for
mechanisms, which - simply put - means we don't know what we're looking for.
We have to keep a lot of data and sort through it, typically many times over,
testing and re-testing what we believe to be relevant events in the phenomena
were studying.
Without the ability to store a large footprint of data and access that data in a
relatively rapid fashion, we would not be able to pursue this line of our
research. In that sense, it is easy to see that Ceph has been a highly enabling
element in my groups research." --Edmund B. Webb III, Associate Professor
Lehigh Ceph Deployment
Digital Library Storage Solution
A Case Study of Implementing the LTS
Storage Strategy
Digital Library: Storage Needs
Digital Special Collections
Digital Repository
Data Management and Preservation
Digital Scholarship
Digital Asset Management
Digital Library: Storage Challenges
Ever Growing Needs vs Limited Resources
15TB Storage Space (DAS+NAS), Running out of Capacity
Need 10TB Storage Immediately
Expect to Reach 50TB of Data in Three Years
$200K One-Time Fund in FY14
Digital Library: Storage Strategies
Collaborate, Leverage, Re-envision
Collaborate with LTS units to tackle storage challenges
Leverage existing EMC cluster to address immediate
digital library and digital storage needs
Re-envision the storage system to create a sustainable
scale-out architecture (focusing on emerging Ceph
technology)
Digital Library Storage: Outcome
55TB on EMC SAN to address immediate storage needs
for Digital Library and Digital Scholarship projects.
450TB on Ceph Storage Cluster will be in the shared LTS
storage pool. Plan to charge users at an affordable price
to recoup the cost and build a life-cycle fund.
With Ceph storage platform, we could create storage
space on demand at an affordable cost. No longer have
to answer no because of storage space limitation.
John Kotters 8 Steps
for Leading Change
Anchoring strategy
High
performance
Virtualization
Digital Library
Open
source
Research Computing
Digital Library
Cloud
services
Enduser
Desktop
Anchoring strategy
Crossing
boundaries
Collaboration
Partnerships
Understand
problem
Unabated need
Fragmented
Anticipate
needs
Access
Sync
Strategic shifts
Before Variable Outcome
Distributed, DAS Centric
Storage
Open source internal cloud
Exploratory
Cloud
Tiered adoption
Cobble, scratch
Resources
Grants + Lifecycle + one-time
Hard to track, no lifecycle
Budget
Substantially reduced TCO
Fragmented
Planning
Participatory
Off-radar
Culture
Vehicle for change
An Executable Big Data
Storage Strategy
Thank you!
James Young
Planning
Roy Gruver
Infrastructure
Chulin Meng
Library
Steve Anthony
Research Computing
What this means for you
Being sustainable
Connect the dots
Reduce expenses
Develop strategy