Sie sind auf Seite 1von 69

+

Moving From Hadoop to


Spark
Sujee Maniyam
Founder / Principal @
www.ElephantScale.com
sujee@elephantscale.com
Bay Area ACM meetup (2015-02-23)

Elephant Scale, 2014

HI,
Featured in Hadoop
Weekly #109

About Me : Sujee Maniyam


n

15 years+ software development


experience

Consulting & Training in Big Data

Author
n
n

Hadoop illuminated open source book


HBase Design Patterns coming soon

Open Source contributor (including HBase)


http://github.com/sujee

Founder / Organizer of Big Data Guru meetup


http://www.meetup.com/BigDataGurus/

http://sujee.net/

Contact : sujee@elephantscale.com

Hadoop in 20 Seconds
n

The Big data platform

Very well field tested

Scales to peta-bytes of data

MapReduce : Batch oriented compute

Elephant Scale, 2014

Hadoop Eco System

Real Time

ElephantScale.com, 2014

Batch

Hadoop Ecosystem
n

HDFS
n

Map Reduce
n

High level MapReduce

Hive
n

Provides distributed computing

Pig
n

provides distributed storage

SQL layer over Hadoop

HBase
n

Nosql storage for realtime queries

ElephantScale.com, 2014

Spark in 20 Seconds
n

Fast & Expressive Cluster computing engine

Compatible with Hadoop

Came out of Berkeley AMP Lab

Now Apache project

Version 1.2 just released (Dec 2014)

First Big Data platform to integrate batch,


streaming and interactive computations in a unified
framework stratio.com
Elephant Scale, 2014

Spark Eco-System
Schema /
sql

Real Time

Machine
Learning

Spark
SQL

Spark
Streaming

ML lib

Graph
processing

GraphX

Spark Core

Stand alone

Elephant Scale, 2014

YARN

MESOS

Cluster
managers

Hypo-meter J

Elephant Scale, 2014

Spark Job Trends

Elephant Scale, 2014

Spark Benchmarks

Source : stratio.com
Elephant Scale, 2014

Spark Code / Activity

Source : stratio.com
Elephant Scale, 2014

Timeline : Hadoop & Spark

Elephant Scale, 2014

Hadoop Vs. Spark

Hadoop

Elephant Scale, 2014

Spark

Source : http://www.kwigger.com/mit-skifte-til-mac/

Comparison With Hadoop


Hadoop

Spark

Distributed Storage + Distributed


Compute

Distributed Compute Only

MapReduce framework

Generalized computation

Usually data on disk (HDFS)

On disk / in memory

Not ideal for iterative work

Great at Iterative workloads


(machine learning ..etc)

Batch process

- Upto 2x - 10x faster for data on


disk
- Upto 100x faster for data in
memory
Compact code
Java, Python, Scala supported
Shell for ad-hoc exploration

Elephant Scale, 2014

Hadoop + Yarn : Universal OS for


Distributed Compute

Batch
(mapreduce)

Streaming
(storm, S4)

In-memory
(spark)

Applications

YARN

Cluster
Management

HDFS

Storage

Spark Is Better Fit for Iterative


Workloads

Elephant Scale, 2014

Spark Programming Model


n

More generic than MapReduce

Elephant Scale, 2014

Is Spark Replacing Hadoop?


n

Spark runs on Hadoop / YARN


n

Complimentary

Spark programming model is more flexible than MapReduce

Spark is really great if data fits in memory (few hundred


gigs),

Spark is storage agnostic (see next slide)

Elephant Scale, 2014

Spark & Pluggable Storage

Spark
(compute
engine)

HDFS

Amazon
S3

Cassandra

???

Spark & Hadoop


Use Case

Other

Spark

Batch processing

Hadoops MapReduce
(Java, Pig, Hive)

Spark RDDs
(java / scala / python)

SQL querying

Hadoop : Hive

Spark SQL

Stream Processing /
Real Time processing

Storm
Kafka

Spark Streaming

Machine Learning

Mahout

Spark ML Lib

Real time lookups

NoSQL (Hbase,
Cassandra ..etc)

No Spark component.
But Spark can query
data in NoSQL stores

Elephant Scale, 2014

Hadoop & Spark Future ???

Why Move From Hadoop to Spark?


n

Spark is easier than Hadoop

friendlier for data scientists / analysts


n

Interactive shell
n fast development cycles
n

API supports multiple languages


n

adhoc exploration

Java, Scala, Python

Great for small (Gigs) to medium (100s of Gigs) data

Elephant Scale, 2014

Spark : Unified Stack


n

Spark supports multiple programming models


n
n
n
n

Querying via SQL


Machine learning

All modules are tightly integrated


n

Map reduce style batch processing


Streaming / real time processing

Facilitates rich applications

Spark can be only stack you need !


n

No need to run multiple clusters


(Hadoop cluster, Storm cluster ..etc)

Elephant Scale, 2014

Image: buymeposters.com

Migrating From Hadoop Spark


Functionality

Hadoop

Spark

Distributed Storage

HDFS

Cloud storage like


Amazon S3
Or NFS mounts

SQL querying

Hive

Spark SQL

ETL work flow

Pig

- Spork : Pig on
Spark
- Mix of Spark
SQL ..etc

Machine Learning

Mahout

ML Lib

NoSQL DB

Hbase

???

Elephant Scale, 2014

Moving From Hadoop Spark


1.

Data size

2.

File System

3.

SQL

4.

ETL

5.

Machine Learning

Elephant Scale, 2014

Hadoop To Spark
Spark can
help

Real Time
Batch

ElephantScale.com, 2014

Big Data

Elephant Scale, 2014

Data Size : You Dont Have Big


Data

Elephant Scale, 2014

1) Data Size (T-shirt sizing)


Spark

< few G

10 G +

100 G +

1 TB +

100 TB +

PB +

Hadoop
Elephant Scale, 2014

Image credit : blog.trumpi.co.za

1) Data Size
n

Lot of Spark adoption at SMALL MEDIUM scale


n

Good fit
Data might fit in memory !!

Hadoop may be overkill

Applications
n

Iterative workloads (Machine learning ..etc)

Streaming

Hadoop is still preferred platform for TB + data

Elephant Scale, 2014

Next : 2) File System

ElephantScale.com, 2014

2) File System
n

Hadoop = Storage + Compute


Spark = Compute only
Spark needs a distributed FS

File system choices for Spark


n

HDFS - Hadoop File System


n Reliable
n Good performance (data locality)
n Field tested for PB of data
S3 : Amazon
n Reliable cloud storage
n Huge scale
NFS : Network File System (shared FS across machines)

Elephant Scale, 2014

Spark File Systems

Elephant Scale, 2014

File Systems For Spark


HDFS

NFS

Amazon S3

Data locality

High
(best)

Local enough

None
(ok)

Throughput

High
(best)

Medium
(good)

Low
(ok)

Latency

Low
(best)

Low

High

Reliability

Very High
(replicated)

Low

Very High

Cost

Varies

Varies

$30 / TB /
Month

Elephant Scale, 2014

File System Throughput


Comparison (HDFS Vs. S3)
n

Data : 10G + (11.3 G)

Each file : ~1+ G ( x 10)

400 million records total

Partition size : 128 M

On HDFS & S3

Cluster :
n
n
n

8 Nodes on Amazon m3.xlarge (4 cpu , 15 G Mem, 40G SSD )


Hadoop cluster , Latest Horton Works HDP v2.2
Spark : on same 8 nodes, stand-alone, v 1.2

Elephant Scale, 2014

File System Throughput


Comparison (HDFS Vs. S3)
val hdfs = sc.textFile("hdfs:///____/10G/")
val s3 = sc.textFile("s3n://______/10G/")

// count # records
hdfs.count()
s3.count()

Elephant Scale, 2014

HDFS Vs. S3

Elephant Scale, 2014

HDFS Vs. S3 (lower is better)

Elephant Scale, 2014

HDFS Vs. S3 Conclusions


HDFS

S3

Data locality much higher


throughput

Data is streamed lower


throughput

Need to maintain an Hadoop


cluster

No Hadoop cluster to maintain


convenient

Large data sets (TB + )

Good use case:


- Smallish data sets (few
gigs)
- Load once and cache and
re-use

Elephant Scale, 2014

Next : 3) SQL

ElephantScale.com, 2014

3) SQL in Hadoop / Spark


Hadoop

Spark

Engine

Hive

Spark SQL

Language

HiveQL

- HiveQL
- RDD programming
in Java / Python /
Scala

Scale

Petabytes

Inter operability
Formats

Elephant Scale, 2014

Terabytes ?
Can read Hive tables
or stand alone data

CSV, JSON, Parquet

CSV, JSON, Parquet

SQL In Hadoop / Spark


n

Input Billing Records / CDR

Timestamp

Customer_id

Resource_id Qty

cost

Milliseconds

String

Int

Int

int

1000

Phone

10

10c

1003

SMS

4c

1005

Data

3M

5c

Query: Find top-10 customers

Data Set
n
n
n

10G + data
400 million records
CSV Format

Elephant Scale, 2014

SQL In Hadoop / Spark


n

Hive Table:

CREATE EXTERNAL TABLE billing (


ts BIGINT,
customer_id INT,
resource_id INT,
qty INT,
cost INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',
stored as textfile
LOCATION hdfs location' ;

Hive Query (simple aggregate)

select customer_id, SUM(cost) as total from billing


group by customer_id order by total DESC LIMIT 10;

Elephant Scale, 2014

Hive Query Results

Elephant Scale, 2014

Spark + Hive Table


n

Spark code to access Hive table

import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val top10 = hiveCtx.sql("select customer_id, SUM(cost) as
total from billing group by customer_id order by total DESC
LIMIT 10")
top10.collect()

Elephant Scale, 2014

Spark SQL Vs. Hive

Fast on
same
HDFS
data !

Elephant Scale, 2014

SQL In Hadoop / Spark :


Conclusions
n

Spark can readily query Hive tables


n

Speed !
Great for exploring / trying-out

Fast iterative development

Spark can load data natively


n

CSV

JSON (Schema automatically inferred)

Parquet (Schema automatically inferred)

Elephant Scale, 2014

Next : 3) ETL In Hadoop / Spark

ElephantScale.com, 2014

ETL?

Data 1

Data 2
(clean)

Data 4

Data 3

Elephant Scale, 2014

3) ETL on Hadoop / Spark


Hadoop

Spark

ETL Tools

Pig, Cascading, Oozie

Native RDD
programming
(Scala, Java, Python)

Pig

High level ETL


workflow

Spork : Pig on Spark

Cascading

High level

Spark-scalding

Elephant Scale, 2014

ETL On Hadoop / Spark


n

Pig
n
n
n
n
n

High level, expressive data flow language (Pig Latin)


Easier to program than Java Map Reduce
Used for ETL (data cleanup / data prep)
Spork : Run Pig on Spark
(as simple as $ pig -x spark ..)
https://github.com/sigmoidanalytics/spork

Cascading
n
n
n
n

High level data flow declarations


Many sources (Cassandra / Accumulo / Solr)
Spark-Scalding
https://github.com/tresata/spark-scalding

Elephant Scale, 2014

ETL On Hadoop / Spark :


Conclusions
n

Try spork or spark-scalding


n
n

Code re-use
Not re-writing from scratch

Program RDDs directly


n

More flexible
Multiple language support : Scala / Java / Python

Simpler / faster in some cases

Elephant Scale, 2014

4) Machine Learning : Hadoop /


Spark
Hadoop

Spark

Tool

Mahout

MLLib

API

Java

Java / Scala / Python

Iterative Algorithms

Slower

Very fast
(in memory)

In Memory
processing

No

YES

Efforts to port Mahout


into Spark

Lots of momentum !

Elephant Scale, 2014

Spark Is Better Fit for Iterative


Workloads

Elephant Scale, 2014

Spark Caching!
n

Reading data from remote FS (S3) can be slow

For small / medium data ( 10 100s of GB) use caching


n
n
n
n

Pay read penalty once


Cache
Then very high speed computes (in memory)
Recommended for iterative work-loads

Elephant Scale, 2014

Caching Demo!

Caching Results

Cached!

Elephant Scale, 2014

Spark Caching
n

Caching is pretty effective (small / medium data sets)

Cached data can not be shared across applications


(each application executes in its own sandbox)

Elephant Scale, 2014

Sharing Cached Data


n

1) spark job server


n

Multiplexer
All requests are executed through same context

Provides web-service interface

2) Tachyon
n

Distributed In-memory file system

Memory is the new disk!

Out of AMP lab , Berkeley


Early stages (very promising)

Elephant Scale, 2014

Spark Job Server

Elephant Scale, 2014

Spark Job Server


n

Open sourced from Ooyala

Spark as a Service simple REST interface to launch jobs

Sub-second latency !

Pre-load jars for even faster spinup

Share cached RDDs across requests (NamedRDD)

App1 :
ctx.saveRDD(my cached rdd, rdd1)
App2:
RDD rdd2 = ctx.loadRDD (my cached rdd)
n

https://github.com/spark-jobserver/spark-jobserver

Elephant Scale, 2014

Tachyon + Spark

Elephant Scale, 2014

Next : New Big Data Applications


With Spark

Elephant Scale, 2014

Big Data Applications : Now


n

Analysis is done in batch mode (minutes / hours)

Final results are stored in a real time data store like


Cassandra / Hbase

These results are displayed in a dashboard / web UI

Doing interactive analysis ????


n

Need special BI tools

Elephant Scale, 2014

With Spark
n

Load data set (Giga bytes) from S3 and cache it (one time)

Super fast (sub-seconds) queries to data

Response time : seconds (just like a web app !)

Elephant Scale, 2014

Lessons Learned
n

Build sophisticated apps !

Web-response-time (few seconds) !!

In-depth analytics
n

Leverage existing libraries in Java / Scala / Python

data analytics as a service

Elephant Scale, 2014

Final Thoughts
n

Already on Hadoop?
n
n
n

Contemplating Hadoop?
n
n

Try Spark (standalone)


Choose NFS or S3 file system

Take advantage of caching


n
n
n

Try Spark side-by-side


Process some data in HDFS
Try Spark SQL for Hive tables

Iterative loads
Spark Job servers
Tachyon

Build new class of big / medium data apps

Elephant Scale, 2014

Thanks !
Sujee Maniyam
sujee@elephantscale.com
http://elephantscale.com
Expert consulting & training in Big Data
(Now offering Spark training)

Das könnte Ihnen auch gefallen