Beruflich Dokumente
Kultur Dokumente
High-Speed In-Memory
Shark
Analytics
over Hadoop and Hive Data
Matei Zaharia, in collaboration with
UC Berkeley
spark-project.org
UC BERKELEY
What is Spark?
Not a modified version of Hadoop
Separate, fast, MapReduce-like engine
In-memory data storage for very fast
iterative queries
General execution graphs and powerful
optimizations
Up to 40x faster than Hadoop
What is Shark?
Port of Apache Hive to run on Spark
Compatible with existing Hive data,
metastores, and queries (HiveQL,
UDFs, etc)
Similar speedups of up to 40x
Project History
Spark project started in 2009, open
sourced 2010
Shark started summer 2011, alpha April
2012
In use at Berkeley, Princeton, Klout,
Foursquare, Conviva, Quantifind, Yahoo!
Research & others
200+ member meetup, 500+ watchers
This Talk
Spark programming model
User applications
Shark overview
Demo
Next major addition: Streaming Spark
Data Sharing in
MapReduce
HDFS
read
HDFS
write
HDFS
read
iter. 1
HDFS
write
. .
.
iter. 2
Input
HDFS
read
Input
query 1
result 1
query 2
result 2
query 3
result 3
. . .
Data Sharing in
Spark
iter. 1
iter. 2
Input
. .
.
query 1
one-time
processing
Input
Distributed
memory
query 2
query 3
. . .
Spark Programming
Model
Key idea: resilient distributed datasets
(RDDs)
Distributed collections of objects that can
be cached in memory across cluster nodes
Manipulated through various parallel
operators
Automatically rebuilt on failure
Interface
Example: Log
Mining
results
errors = lines.fi
lter(_.startsW ith(ERRO R))
m essages = errors.m ap(_.split(\t)(2))
cachedM sgs = m essages.cache()
cachedM sgs.fi
lter(_.contains(foo)).count
Drive
r
Result:
Result:scaled
full-text
to search
1 TB data
of
Wikipedia
in in
5-7
<1sec
sec (vs 20
(vs 170
sec for
secon-disk
for on-disk
data)
data)
tasks Block 1
Action
Cache
Work 2
cachedM sgs.fi
lter(_.contains(bar)).count
. ..
er
Cache
Work 3
er
Block 3
er
Block 2
Fault Tolerance
RDDs track the series of
transformations used to build them
(their
lineage)
to
recompute
lost
data
m essages = textFile(...).fi
lter(_.contains(error))
E.g:
HadoopRDD
path = hdfs://
.m ap(_.split(\t)(2))
FilteredRDD
func =
_.contains(...)
MappedRDD
func = _.split()
Logistic Regression
Performance
4500
4000
3500
3000
2500
2000
Running Time (s)
1500
1000
500
0
127 s / iteration
Hado
op
Supported
Operators
m ap
fi
lter
groupBy
sort
join
leftO uterJoin
rightO uterJoin
reduce
count
reduceByKey
groupByKey
fi
rst
union
cross
sam ple
cogroup
take
partitionBy
pipe
save
...
Other Engine
Features
General graphs of operators (e.g. mapreduce-reduce)
Hash-based reduces (faster than Hadoops
sort)
PageRank
Performance
Controlled data
partitioning
to lower
200
171 (s)
communication
Iteration time
Hadoop
150
100
50
0
Basic Spark
72
23
Spark +
Controlled
Partitioning
Spark Users
User Applications
In-memory analytics & anomaly detection
(Conviva)
Interactive queries on data streams (Quantifind)
Exploratory log analysis (Foursquare)
Traffic estimation w/ GPS data (Mobile Millennium)
Twitter spam classification (Monarch)
...
Conviva
GeoReport
Hive
20
Spark 0.5
0
Time
8 10 12 14 16 18 20 (hours)
Mobile Millennium
Project
Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu
Shark: Hive on
Spark
Motivation
Hive is great, but Hadoops execution
engine makes even the smallest
queries take minutes
Scala is good for programmers, but
many data users only know SQL
Can we extend Hive to run on
Spark?
Hive Architecture
Client
Meta
store
CLI
JDBC
Driver
Physical
SQL
Query
Plan
Parse Optimize
Execution
r
r
MapReduce
HDFS
Shark
Architecture
Client
Meta
store
CLI
JDBC
Cache
Driver
Mgr.
Physical
SQL
Query
Plan
Parse Optimize
Execution
r
r
Spark
HDFS
Efficient In-Memory
Storage
Simply caching Hive records as Java
objects is inefficient due to high perobject overhead
Instead, Shark employs columnoriented
storage using
arrays of
Row Storage
Column Storage
primitive types
1
2
3
john
mik
e
sall
y
4.1
3.5
john
mik
e
sall
y
6.4
4.1
3.5
6.4
Efficient In-Memory
Storage
Simply caching Hive records as Java
objects is inefficient due to high perobject overhead
Instead, Shark employs columnoriented
storage using
arrays of
Row Storage
Column Storage
primitive
types
Benefit: similarly compact size to
1
2
3
john
4.1
serialized
data,
mik sall
3.5
john
e
y
but >5x faster to access
mik
e
sall
y
6.4
4.1
3.5
6.4
Using Shark
CREATE TABLE m ydata_cached AS SELEC T
Can alsoEarly
call from
Scala
to mix
alpha
release
at with
Spark
shark.cs.berkeley.edu
Benchmark Query 1
SELECT * FRO M grep W H ERE fi
eld LIKE % XYZ% ;
Shark
182s
Hive
207s
50
100
150
ExecutionTime (secs)
200
250
Benchmark Query 2
SELECT sourceIP, AVG (pageRank), SU M (adRevenue) AS earnings
FRO M rankings AS R, userVisits AS V O N R.pageU RL = V.destU RL
W H ERE V.visitD ate BETW EEN 1999-01-01 AN D 2000-01-01
G RO U P BY V.sourceIP
O RD ER BY earnings D ESC
LIM IT 1;
Shark (cached)
126s
Shark
270s
Hive
447s
100
200
300
ExecutionTime (secs)
400
500
Demo
Whats Next?
Recall that Sparks model was
motivated by two emerging uses
(interactive and multi-stage apps)
Another emerging use case that needs
fast data sharing is stream
processing
Streaming Spark
Extends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping
state in memory as fault-tolerant RDDs
Intermix seamlessly with batch and ad-hoc queries
map reduceByWindow
tw eetStream
.fl
atM ap(_.toLow er.split)
.m ap(w ord = > (w ord,1))
.reduceByW indow (5s,_ + _)
T=1
T=2
Streaming Spark
Extends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping
state in memory as fault-tolerant RDDs
Intermix seamlessly with batch and ad-hoc queries
map reduceByWindow
tw eetStream
.fl
atM ap(_.toLow er.split)
.m ap(w ord = > (w ord,1))
.reduceByW indow (5, _ + _)
Streaming Spark
Extends Spark to perform streaming computations
Runs as a series of small (~1 s) batch jobs, keeping
state in memory as fault-tolerant RDDs
Intermix seamlessly with batch and ad-hoc queries
map reduceByWindow
tw eetStream
.fl
atM ap(_.toLow er.split)
.m ap(w ord = > (w ord,1))
.reduceByW indow (5, _ + _)
T=1
T=2
Conclusion
Spark and Shark speed up your
interactive and complex analytics on
Hadoop data
Download and docs:
www.spark-project.org
40
58.1
40.7
20
0
Cache disabled 50%
29.7
11.5
Fully cached
Software Stack
Shark
(Hive on
Spark)
Bagel
(Pregel on
Spark)
Streaming
Spark
Spark
Local
mode
EC2
Apache
Mesos
YARN