Hadoop To Spark-V2

+
Moving From Hadoop to

Spark
Sujee Maniyam
Founder / Principal @
www.ElephantScale.com
sujee@elephantscale.com
Bay Area ACM meetup (2015-02-23)
Elephant Scale, 2014
HI,
Featured in Hadoop
Weekly #109
About Me : Sujee Maniyam

n
15 years+ software development

experience
Consulting & Training in Big Data
Author
n
n
Hadoop illuminated open source book

HBase Design Patterns coming soon
Open Source contributor (including HBase)

http://github.com/sujee
Founder / Organizer of Big Data Guru meetup

http://www.meetup.com/BigDataGurus/
http://sujee.net/
Contact : sujee@elephantscale.com
Hadoop in 20 Seconds
n
The Big data platform
Very well field tested
Scales to peta-bytes of data
MapReduce : Batch oriented compute
Hadoop Eco System
Real Time
ElephantScale.com, 2014
Batch
Hadoop Ecosystem
n
HDFS
n
Map Reduce
n
High level MapReduce
Hive
n
Provides distributed computing
Pig
n
provides distributed storage
SQL layer over Hadoop
HBase
n
Nosql storage for realtime queries
Spark in 20 Seconds
n
Fast & Expressive Cluster computing engine
Compatible with Hadoop
Came out of Berkeley AMP Lab
Now Apache project
Version 1.2 just released (Dec 2014)
First Big Data platform to integrate batch,

streaming and interactive computations in a unified
framework stratio.com
Spark Eco-System
Schema /
sql
Real Time
Machine
Learning
Spark
SQL
Spark
Streaming
ML lib
Graph
processing
GraphX
Spark Core
Stand alone
YARN
MESOS
Cluster
managers
Hypo-meter J
Spark Job Trends
Spark Benchmarks
Source : stratio.com
Spark Code / Activity
Source : stratio.com
Timeline : Hadoop & Spark
Hadoop Vs. Spark
Hadoop
Spark
Source : http://www.kwigger.com/mit-skifte-til-mac/
Comparison With Hadoop

Hadoop
Spark
Distributed Storage + Distributed

Compute
Distributed Compute Only
MapReduce framework
Generalized computation
Usually data on disk (HDFS)
On disk / in memory
Not ideal for iterative work
Great at Iterative workloads

(machine learning ..etc)
Batch process
- Upto 2x - 10x faster for data on

disk
- Upto 100x faster for data in
memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration
Hadoop + Yarn : Universal OS for

Distributed Compute
Batch
(mapreduce)
Streaming
(storm, S4)
In-memory
(spark)
Applications
YARN
Cluster
Management
HDFS
Storage
Spark Is Better Fit for Iterative

Workloads
Spark Programming Model

n
More generic than MapReduce
Is Spark Replacing Hadoop?

n
Spark runs on Hadoop / YARN

n
Complimentary
Spark programming model is more flexible than MapReduce
Spark is really great if data fits in memory (few hundred

gigs),
Spark is storage agnostic (see next slide)
Spark & Pluggable Storage
Spark
(compute
engine)
HDFS
Amazon
S3
Cassandra
???
Spark & Hadoop

Use Case
Other
Spark
Batch processing
Hadoops MapReduce
(Java, Pig, Hive)
Spark RDDs
(java / scala / python)
SQL querying
Hadoop : Hive
Spark SQL
Stream Processing /
Real Time processing
Storm
Kafka
Spark Streaming
Machine Learning
Mahout
Spark ML Lib
Real time lookups
NoSQL (Hbase,
Cassandra ..etc)
No Spark component.
But Spark can query
data in NoSQL stores
Hadoop & Spark Future ???
Why Move From Hadoop to Spark?

n
Spark is easier than Hadoop
friendlier for data scientists / analysts

n
Interactive shell
n fast development cycles
n
API supports multiple languages

n
adhoc exploration
Java, Scala, Python
Great for small (Gigs) to medium (100s of Gigs) data
Spark : Unified Stack

n
Spark supports multiple programming models

n
n
n
n
Querying via SQL

Machine learning
All modules are tightly integrated

n
Map reduce style batch processing

Streaming / real time processing
Facilitates rich applications
Spark can be only stack you need !

n
No need to run multiple clusters

(Hadoop cluster, Storm cluster ..etc)
Image: buymeposters.com
Migrating From Hadoop Spark

Functionality
Hadoop
Spark
Distributed Storage
HDFS
Cloud storage like

Amazon S3
Or NFS mounts
SQL querying
Hive
Spark SQL
ETL work flow
Pig
- Spork : Pig on
Spark
- Mix of Spark
SQL ..etc
Machine Learning
Mahout
ML Lib
NoSQL DB
Hbase
???
Moving From Hadoop Spark

1.
Data size
2.
File System
3.
SQL
4.
ETL
5.
Machine Learning
Hadoop To Spark
Spark can
help
Real Time
Batch
Big Data
Data Size : You Dont Have Big

Data
1) Data Size (T-shirt sizing)

Spark
< few G
10 G +
100 G +
1 TB +
100 TB +
PB +
Hadoop
Image credit : blog.trumpi.co.za
1) Data Size
n
Lot of Spark adoption at SMALL MEDIUM scale

n
Good fit
Data might fit in memory !!
Hadoop may be overkill
Applications
n
Iterative workloads (Machine learning ..etc)
Streaming
Hadoop is still preferred platform for TB + data
Next : 2) File System
2) File System
n
Hadoop = Storage + Compute

Spark = Compute only
Spark needs a distributed FS
File system choices for Spark

n
HDFS - Hadoop File System

n Reliable
n Good performance (data locality)
n Field tested for PB of data
S3 : Amazon
n Reliable cloud storage
n Huge scale
NFS : Network File System (shared FS across machines)
Spark File Systems
File Systems For Spark

HDFS
NFS
Amazon S3
Data locality
High
(best)
Local enough
None
(ok)
Throughput
High
(best)
Medium
(good)
Low
(ok)
Latency
Low
(best)
Low
High
Reliability
Very High
(replicated)
Low
Very High
Cost
Varies
Varies
$30 / TB /
Month
File System Throughput

Comparison (HDFS Vs. S3)
n
Data : 10G + (11.3 G)
Each file : ~1+ G ( x 10)
400 million records total
Partition size : 128 M
On HDFS & S3
Cluster :
n
n
n
8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )

Hadoop cluster , Latest Horton Works HDP v2.2
Spark : on same 8 nodes, stand-alone, v 1.2
File System Throughput

Comparison (HDFS Vs. S3)
val hdfs = sc.textFile("hdfs:///____/10G/")
val s3 = sc.textFile("s3n://______/10G/")
// count # records
hdfs.count()
s3.count()
HDFS Vs. S3
HDFS Vs. S3 (lower is better)
HDFS Vs. S3 Conclusions

HDFS
S3
Data locality much higher

throughput
Data is streamed lower

throughput
Need to maintain an Hadoop

cluster
No Hadoop cluster to maintain

convenient
Large data sets (TB + )
Good use case:

- Smallish data sets (few
gigs)
- Load once and cache and
re-use
Next : 3) SQL
3) SQL in Hadoop / Spark

Hadoop
Spark
Engine
Hive
Spark SQL
Language
HiveQL
- HiveQL
- RDD programming
in Java / Python /
Scala
Scale
Petabytes
Inter operability
Formats
Terabytes ?
Can read Hive tables
or stand alone data
CSV, JSON, Parquet
CSV, JSON, Parquet
SQL In Hadoop / Spark

n
Input Billing Records / CDR
Timestamp
Customer_id
Resource_id Qty
cost
Milliseconds
String
Int
Int
int
1000
Phone
10
10c
1003
SMS
4c
1005
Data
3M
5c
Query: Find top-10 customers
Data Set
n
n
n
10G + data
400 million records
CSV Format
SQL In Hadoop / Spark

n
Hive Table:
CREATE EXTERNAL TABLE billing (

ts BIGINT,
customer_id INT,
resource_id INT,
qty INT,
cost INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',
stored as textfile
LOCATION hdfs location' ;
Hive Query (simple aggregate)
select customer_id, SUM(cost) as total from billing

group by customer_id order by total DESC LIMIT 10;
Hive Query Results
Spark + Hive Table

n
Spark code to access Hive table
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql("select customer_id, SUM(cost) as
total from billing group by customer_id order by total DESC
LIMIT 10")
top10.collect()
Spark SQL Vs. Hive
Fast on
same
HDFS
data !
SQL In Hadoop / Spark :

Conclusions
n
Spark can readily query Hive tables

n
Speed !
Great for exploring / trying-out
Fast iterative development
Spark can load data natively

n
CSV
JSON (Schema automatically inferred)
Parquet (Schema automatically inferred)
Next : 3) ETL In Hadoop / Spark
ETL?
Data 1
Data 2
(clean)
Data 4
Data 3
3) ETL on Hadoop / Spark

Hadoop
Spark
ETL Tools
Pig, Cascading, Oozie
Native RDD
programming
(Scala, Java, Python)
Pig
High level ETL

workflow
Spork : Pig on Spark
Cascading
High level
Spark-scalding
ETL On Hadoop / Spark

n
Pig
n
n
n
n
n
High level, expressive data flow language (Pig Latin)

Easier to program than Java Map Reduce
Used for ETL (data cleanup / data prep)
Spork : Run Pig on Spark
(as simple as $ pig -x spark ..)
https://github.com/sigmoidanalytics/spork
Cascading
n
n
n
n
High level data flow declarations

Many sources (Cassandra / Accumulo / Solr)
Spark-Scalding
https://github.com/tresata/spark-scalding
ETL On Hadoop / Spark :

Conclusions
n
Try spork or spark-scalding

n
n
Code re-use
Not re-writing from scratch
Program RDDs directly

n
More flexible
Multiple language support : Scala / Java / Python
Simpler / faster in some cases
4) Machine Learning : Hadoop /

Spark
Hadoop
Spark
Tool
Mahout
MLLib
API
Java
Java / Scala / Python
Iterative Algorithms
Slower
Very fast
(in memory)
In Memory
processing
No
YES
Efforts to port Mahout

into Spark
Lots of momentum !
Spark Is Better Fit for Iterative

Workloads
Spark Caching!
n
Reading data from remote FS (S3) can be slow
For small / medium data ( 10 100s of GB) use caching

n
n
n
n
Pay read penalty once

Cache
Then very high speed computes (in memory)
Recommended for iterative work-loads
Caching Demo!
Caching Results
Cached!
Spark Caching
n
Caching is pretty effective (small / medium data sets)
Cached data can not be shared across applications

(each application executes in its own sandbox)
Sharing Cached Data

n
1) spark job server

n
Multiplexer
All requests are executed through same context
Provides web-service interface
2) Tachyon
n
Distributed In-memory file system
Memory is the new disk!
Out of AMP lab , Berkeley

Early stages (very promising)
Spark Job Server
Spark Job Server

n
Open sourced from Ooyala
Spark as a Service simple REST interface to launch jobs
Sub-second latency !
Pre-load jars for even faster spinup
Share cached RDDs across requests (NamedRDD)
App1 :
ctx.saveRDD(my cached rdd, rdd1)
App2:
RDD rdd2 = ctx.loadRDD (my cached rdd)
n
https://github.com/spark-jobserver/spark-jobserver
Tachyon + Spark
Next : New Big Data Applications

With Spark
Big Data Applications : Now

n
Analysis is done in batch mode (minutes / hours)
Final results are stored in a real time data store like

Cassandra / Hbase
These results are displayed in a dashboard / web UI
Doing interactive analysis ????

n
Need special BI tools
With Spark
n
Load data set (Giga bytes) from S3 and cache it (one time)
Super fast (sub-seconds) queries to data
Response time : seconds (just like a web app !)
Lessons Learned
n
Build sophisticated apps !
Web-response-time (few seconds) !!
In-depth analytics
n
Leverage existing libraries in Java / Scala / Python
data analytics as a service
Final Thoughts
n
Already on Hadoop?
n
n
n
Contemplating Hadoop?
n
n
Try Spark (standalone)

Choose NFS or S3 file system
Take advantage of caching

n
n
n
Try Spark side-by-side

Process some data in HDFS
Try Spark SQL for Hive tables
Iterative loads
Spark Job servers
Tachyon
Build new class of big / medium data apps
Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data
(Now offering Spark training)

Hadoop To Spark-V2

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop To Spark-V2

Hochgeladen von

Copyright:

Verfügbare Formate

+

Moving From Hadoop to

Elephant Scale, 2014

About Me : Sujee Maniyam

15 years+ software development

Consulting & Training in Big Data

Hadoop illuminated open source book

Open Source contributor (including HBase)

Founder / Organizer of Big Data Guru meetup

The Big data platform

Very well field tested

Scales to peta-bytes of data

MapReduce : Batch oriented compute

Elephant Scale, 2014

Hadoop Eco System

High level MapReduce

Provides distributed computing

provides distributed storage

SQL layer over Hadoop

Nosql storage for realtime queries

Fast & Expressive Cluster computing engine

Compatible with Hadoop

Came out of Berkeley AMP Lab

Now Apache project

Version 1.2 just released (Dec 2014)

First Big Data platform to integrate batch,

Elephant Scale, 2014

Elephant Scale, 2014

Spark Job Trends

Elephant Scale, 2014

Spark Code / Activity

Timeline : Hadoop & Spark

Elephant Scale, 2014

Hadoop Vs. Spark

Elephant Scale, 2014

Comparison With Hadoop

Distributed Storage + Distributed

Distributed Compute Only

Usually data on disk (HDFS)

Not ideal for iterative work

Great at Iterative workloads

- Upto 2x - 10x faster for data on

Elephant Scale, 2014

Hadoop + Yarn : Universal OS for

Spark Is Better Fit for Iterative

Elephant Scale, 2014

Spark Programming Model

More generic than MapReduce

Elephant Scale, 2014

Is Spark Replacing Hadoop?

Spark runs on Hadoop / YARN

Spark programming model is more flexible than MapReduce

Spark is really great if data fits in memory (few hundred

Spark is storage agnostic (see next slide)

Elephant Scale, 2014

Spark & Pluggable Storage

Spark & Hadoop

Real time lookups

Elephant Scale, 2014

Hadoop & Spark Future ???

Why Move From Hadoop to Spark?

Spark is easier than Hadoop

friendlier for data scientists / analysts