Beruflich Dokumente
Kultur Dokumente
ANALYTICS ON SPARK
Joseph Gonzalez, Reynold Xin, Ankur Dave, Daniel Crankshaw,
Michael Franklin, and Ion Stoica
XML"
Top 20 Pages
Title
PR
Text
Table
Title
Body
<</ />">"
< / >"
PageRank
Term-Doc
Graph
Topic Model
(LDA)
Word Topics
Word
Topic
Discussion
Table
User
Disc.
Community
User
Editor Graph
Detection
Community
User
Com.
Community
Topic
Topic
Com.
wji R[j]
j2Nbrs(i)
Rank of
user i
Iterate until convergence
Update ranks in parallel
Weighted sum of
neighbors ranks
PageRank
R[i] = 0.15 +
wji R[j]
j2Nbrs(i)
0.5
0.5
3.65
1
1
1
j2Nbrs(i)
0
3
2
Model / Alg.
State
Computation depends
only on the neighbors
Structured Prediction
> Loopy Belief Propagation
> Max-Product Linear Programs
> Gibbs Sampling
Semi-supervised ML
> Graph SSL
> CoEM
Community Detection
> Triangle-Counting
> K-core Decomposition
> K-Truss
Graph Analytics
>
>
>
>
PageRank
Personalized PageRank
Shortest Path
Graph Coloring
Classification
> Neural Networks
Graph-Parallel Systems
oogle
1340
354
Nave Spark
22
GraphLab
0
200
400
600
800
1000
1200
1400
1600
Text
Table
Title
Body
<</ />">"
< / >"
XML"
Title
PR
Term-Doc
Graph
Topic Model
(LDA)
Word Topics
Word
Topic
Discussion
Table
User
Disc.
Community
User
Editor Graph
Detection
Community
User
Com.
Community
Topic
Topic
Com.
Data-Parallel
Graph-Parallel
7. After
8. After
Table
Dependency Graph
Row
Row
Row
Row
Result
Inefficient
Extensive data movement and duplication across
the network and file system
<</ />">"
< / >"
XML"
HDFS
HDFS
HDFS
HDFS
New API
New Library
Embeds Graph-Parallel
model in Spark
Table View
GraphX Unified
Representation
Graph View
Property Graph
R
Id
Attribute (V)
Rxin
(Stu., Berk.)
Jegonzal
(PstDoc, Berk.)
Franklin
(Prof., Berk)
Istoica
(Prof., Berk)
Edge Table
J
SrcId
DstId
Attribute (E)
rxin
jegonzal
Friend
franklin
rxin
Advisor
istoica
franklin
Coworker
franklin
jegonzal
PI
Table Operators
Table (RDD) operators are inherited from Spark:
map
reduce
sample
filter
count
take
groupBy
fold
first
sort
reduceByKey
partitionBy
union
groupByKey
mapWith
join
cogroup
pipe
leftOuterJoin
cross
save
rightOuterJoin
zip
...
Graph Operators
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views ----------------def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Computation ---------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
ALS
(40)
K-core
(51)
Triangle
LDA
Count
(120)
(45)
A
B
Edges:
A
Triplets:
A
mapF(
A1
mapF(
A2
reduceF(
A1
A2
A
F
42
C
30
19
16
75
PageRank in GraphX
Graphs are
first-class
objects
27
Implementation
A
Cut Heuristic
D
D
A
2D Vertex
A
D
Vertex
Table
(RDD)
E
Part. 2
Edge Table
(RDD)
A
1 2
1 2
Routing
Table
(RDD)
Edge Table
(RDD)
Mirror
Cache
A
B
C
D
Mirror
Cache
A
D
E
F
mrTriplets Execution"
Vertex
Table
(RDD)
Mirror
Cache
A
Change
Change
Change
Edge Table
(RDD)
A
A
Local
Aggregate
B
C
D
Mirror
Cache
Change
Scan
Local
Aggregate
Optimizations
1.
2.
3.
4.
5.
A
B
C
D
Edge Table
(RDD)
Mirror
Cache
A
B
C
D
Mirror
Cache
A
Change
E
F
10000
1000
100
10
1
0.1
0
8
Iteration
10
12
14
16
2. Join Elimination
Vertex
Table
(RDD)
A
A
B
B
Edge Table
(RDD)
Mirror
Cache
A
B
C
C
C
D
D
Mirror
Cache
E
E
F
F
Communication (MB)
14000
12000
10000
8000
6000
4000
2000
0
Join Elimination
10
Iteration
15
20
36
30
Scan
25
Indexed
20
15
10
5
0
0
8
Iteration
10
12
14
16
Multi-System Comparison
PageRank
Connected Components
Fault Tolerance
GraphX Training
Raw
Wikipedia
<</ />">"
< / >"
Text
Table
XML"
Title
Body
Hyperlinks
Berkeley
subgraph
PageRank
Top 20 Pages
Title
PR
Demo
Future Directions
Computation on time-varying graphs
Graph serving
Operating on compressed graphs
Thanks!
http://spark.apache.org/graphx
{jegonzal, rxin, ankurd, crankshaw}@eecs.berkeley.edu