Spark: Fast, Interactive, Language-Integrated Cluster Computing

Spark
Fast, Interactive, LanguageIntegrated Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das,
Ankur Dave, Justin Ma, Murphy McCauley, Michael
Franklin,
Scott Shenker, Ion Stoica
www.spark-project.org
UC BERKELEY
Project Goals
Extend the MapReduce model to better
support two common classes of
analytics apps:
Iterative algorithms (machine
learning, graphs)
Interactive data mining
Enhance programmability:
Integrate into Scala programming

language
Motivation
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
Map
Input
Reduc
e
Map
Map
Reduc
e
Outpu
t
Motivation
Most current cluster programming
models are based on acyclic data flow
from stable storage to stable storage
Map
Reduc
e
Map
Benefits of data flow: runtime can

Map
Input
decide
where
to run tasks andOutpu
can
t
automatically recover Reduc
from failures
Motivation
Acyclic data flow is inefficient for
applications that repeatedly reuse a
working set of data:
Iterative algorithms (machine
learning, graphs)
Interactive data mining tools (R,
Excel, Python)
With current frameworks, apps reload

data from stable storage on each
Solution: Resilient
Distributed Datasets
(RDDs)
Allow apps to keep working sets in
memory for efficient reuse
Retain the attractive properties of
MapReduce
Fault tolerance, data locality, scalability
Support a wide range of applications
Outline
Spark programming model
Implementation
Demo
User applications
Programming
Model
Resilient distributed datasets (RDDs)
Immutable, partitioned collections of
objects
Created through parallel transformations
(map, filter, groupBy, join, ) on data in
stable storage
Can be cached for efficient reuse
Actions on RDDs
Count, reduce, collect, save,
Example: Log
Mining
Load error messages from a log into

memory, then interactively search for
Cache
Base
Transformed
various patterns
Work 1
RDD RDD
lines = spark.textFile(hdfs://...)
RDD
results
errors = lines.fi
lter(_.startsW ith(ERRO R))
m essages = errors.m ap(_.split(\t)(2))
cachedM sgs = m essages.cache()
cachedM sgs.fi
lter(_.contains(foo)).count
Drive
r
Result:
Result:scaled
full-text
to search
1 TB data
of
Wikipedia
in in
5-7
<1sec
sec (vs 20
(vs 170
sec for
secon-disk
for on-disk
data)
data)
tasks Block 1
Action
Cache
Work 2
cachedM sgs.fi
lter(_.contains(bar)).count
. ..
er
Cache
Work 3
er
Block 3
er
Block 2
RDD Fault
Tolerance
RDDs maintain lineage information that

can be used to reconstruct lost
partitions
m essages = textFile(...).fi
lter(_.startsW ith(ERRO R))
.m ap(_.split(\t)(2))
Ex:
HDFS File
Filtered
RDD
Mapped
RDD
filter
map
(func = _.contains(...)) (func = _.split(...))
Example: Logistic
Regression
Goal: find best line separating two sets
of points
random initial line
+
+ ++ +
+

+ +
+ +

target
Example: Logistic Regression

valdata = spark.textFile(...).m ap(readPoint).cache()
var w = Vector.random (D )
for (i< - 1 to ITERATIO N S) {
valgradient = data.m ap(p = >
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Finalw : " + w )
Logistic Regression
Performance
4500
4000
3500
3000
2500
2000
Running Time (s)
1500
1000
500
0
127 s / iteration
Hado
op
first iteration 174

s
further iterations
Number of Iterations
6s
Spark Applications
In-memory data mining on Hive data
(Conviva)
Predictive analytics (Quantifind)
City traffic prediction (Mobile
Millennium)
Twitter spam classification (Monarch)
Collaborative filtering via matrix
factorization
Conviva
GeoReport
Hive
20
Spark 0.5
0
Time
8 10 12 14 16 18 20 (hours)
Aggregations on many keys w/ same WHERE

clause
40 gain comes from:
Not re-reading unused columns or filtered records

Avoiding repeated decompression
In-memory storage of deserialized objects
Frameworks Built on
Spark
Pregel on Spark (Bagel)
Google message passing

model for graph computation
200 lines of code
Hive on Spark (Shark)
3000 lines of code

Compatible with Apache Hive
ML operators in Scala
Implementation
Runs on Apache
Mesos to share
resources with
Hadoop & other apps
Spark
Hadoo
p
MPI
Mesos
Node Node Node Node

Can read from any
Hadoop input source
No
changes
(e.g.
HDFS) to Scala compiler
Spark Scheduler
Dryad-like DAGs
B:
A:
Pipelines functions
Stage 1
within a stage
Cache-aware workC:
reuse & locality
G:
groupBy
D:
F:
map
E:
Partitioning-aware
to avoid shuffles Stage 2
join
union
Stage 3
= cached data partition
Interactive Spark
Modified Scala interpreter to allow
Spark to be used interactively from the
command line
Required two changes:
Modified wrapper code generation so that

each line typed has references to objects
for its dependencies
Distribute generated classes over the
network
Demo
Conclusion
Spark provides a simple, efficient, and
powerful programming model for a
wide range of apps
Download our open source release:
www.spark-project.org
matei@berkeley.edu
Related Work
DryadLINQ, FlumeJava
Similar distributed collection API, but cannot reuse

datasets efficiently across queries
Relational databases
Lineage/provenance, logical logging, materialized views
GraphLab, Piccolo, BigTable, RAMCloud
Fine-grained writes similar to distributed shared memory
Iterative MapReduce (e.g. Twister, HaLoop)
Implicit data sharing for a fixed computation pattern
Caching systems (e.g. Nectar)
Store data in files, no explicit control over what is cached
Behavior with Not

Enough RAM
100
80 68.8
60
Iteration time (s)
40
58.1
40.7
20
0
Cache disabled 50%
29.7
11.5
Fully cached
% of working set in memory
Fault Recovery
Results
140
No Failure
119
120
100
81
80
58 58
57
57 59 57 59
56
Iteratrion
time
(s)
60
40
20
0
1
2
3
4
5
6
7
8
9 10
Iteration
Spark Operations
Transformatio
ns
(define a new
RDD)
Actions
(return a result
to driver
program)
map
filter
sample
groupByKey
reduceByKey
sortByKey
flatMap
union
join
cogroup
cross
mapValues
collect
reduce
count
save
lookupKey

Spark: Fast, Interactive, Language-Integrated Cluster Computing

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Spark: Fast, Interactive, Language-Integrated Cluster Computing

Hochgeladen von

Copyright:

Verfügbare Formate

Spark

Fast, Interactive, LanguageIntegrated Cluster Computing

Integrate into Scala programming

Benefits of data flow: runtime can

With current frameworks, apps reload

Fault tolerance, data locality, scalability

Support a wide range of applications

Count, reduce, collect, save,

Load error messages from a log into

RDDs maintain lineage information that

Example: Logistic Regression

first iteration 174

Aggregations on many keys w/ same WHERE

Not re-reading unused columns or filtered records

Google message passing

Hive on Spark (Shark)

3000 lines of code

Node Node Node Node

= cached data partition

Modified wrapper code generation so that

Similar distributed collection API, but cannot reuse

GraphLab, Piccolo, BigTable, RAMCloud

Fine-grained writes similar to distributed shared memory

Iterative MapReduce (e.g. Twister, HaLoop)

Implicit data sharing for a fixed computation pattern

Caching systems (e.g. Nectar)

Store data in files, no explicit control over what is cached

Behavior with Not

% of working set in memory

Das könnte Ihnen auch gefallen