Sie sind auf Seite 1von 44

GRAPHX: UNIFIED GRAPH

ANALYTICS ON SPARK
Joseph Gonzalez, Reynold Xin, Ankur Dave, Daniel Crankshaw,
Michael Franklin, and Ion Stoica

*These slides are best viewed in PowerPoint with animation.


1. Motivation for GraphX


2. GraphX Implementation
3. Future Directions

Motivation for GraphX

Graphs are Central to Analytics


Hyperlinks

Raw

Wikipedia

XML"

Top 20 Pages

Title
PR

Text

Table

Title
Body

<</ />">"
< / >"

PageRank

Term-Doc

Graph

Topic Model

(LDA)

Word Topics

Word
Topic

Discussion

Table

User
Disc.

Community

User

Editor Graph
Detection
Community

User
Com.

Community

Topic

Topic
Com.

PageRank: Identifying Leaders


R[i] = 0.15 +

wji R[j]

j2Nbrs(i)

Rank of
user i

Iterate until convergence
Update ranks in parallel

Weighted sum of
neighbors ranks

PageRank
R[i] = 0.15 +

wji R[j]

j2Nbrs(i)

0.5

0.5

3.65

1

1

1

Single-Source Shortest Path


D[i] = 1 + arg min D[j]

j2Nbrs(i)
0

3
2

The Graph-Parallel Pattern

Model / Alg.
State

Computation depends
only on the neighbors

Many Graph-Parallel Algorithms


Collaborative Filtering

> Alternating Least Squares

> Stochastic Gradient Descent

> Tensor Factorization

Structured Prediction

> Loopy Belief Propagation

> Max-Product Linear Programs

> Gibbs Sampling

Semi-supervised ML

> Graph SSL

> CoEM

Community Detection

> Triangle-Counting

> K-core Decomposition

> K-Truss

Graph Analytics

>
>
>
>

PageRank

Personalized PageRank

Shortest Path

Graph Coloring

Classification

> Neural Networks

Graph-Parallel Systems

oogle

Expose specialized APIs to simplify graph


programming.



Exploit graph structure to achieve orders-ofmagnitude performance gains over more general 
data-parallel systems.

Specialized API: Pregel


Vertex-Programs interact by sending messages.

Pregel_PageRank(i, messages) :
// Receive all the messages
total = 0
foreach( msg in messages) :
total = total + msg

// Update the rank of this vertex
R[i] = 0.15 + total

// Send new messages to neighbors
foreach(j in out_neighbors[i]) :
Send msg(R[i]) to vertex j

Malewicz et al. [PODC09, SIGMOD10]


PageRank on LiveJournal Graph


(69M edges)
Mahout/Hadoop

1340

354

Nave Spark

22

GraphLab

0

200

400

600

800

1000

1200

1400

Runtime (in seconds, PageRank for 10 iterations)


Spark is 4x faster than Hadoop



GraphLab is 16x faster than Spark

1600

Specialized Systems Miss the


Bigger Picture Hyperlinks
PageRank
Top 20 Pages

Raw

Wikipedia

Text

Table

Title
Body

<</ />">"
< / >"
XML"

Title
PR

Term-Doc

Graph

Topic Model

(LDA)

Word Topics

Word
Topic

Discussion

Table

User
Disc.

Community

User

Editor Graph
Detection
Community

User
Com.

Community

Topic

Topic
Com.

Separate Systems to Support Each View


6. Before

Data-Parallel

Graph-Parallel

7. After

8. After

Table

Dependency Graph

Row

Row

Row

Row

Result

Having separate systems !


for each view is !
difficult to use and inefficient

Difficult to Program and Use


Users must Learn, Deploy, and Manage
multiple systems







Leads to brittle and often 
complex interfaces

Inefficient
Extensive data movement and duplication across 
the network and file system

<</ />">"
< / >"
XML"

HDFS

HDFS

HDFS

HDFS

Limited reuse internal data-structures 


across stages

Solution: The GraphX Unified Approach

New API

Blurs the distinction between


Tables and Graphs

New Library

Embeds Graph-Parallel
model in Spark

Enabling users to easily and efficiently


express the entire graph analytics pipeline

Tables and Graphs are composable "


views of the same physical data

Table View

GraphX Unified

Representation

Graph View

Each view has its own operators that 


exploit the semantics of the view

to achieve efficient execution

View a Graph as a Table


Vertex Table

Property Graph

R

Id

Attribute (V)

Rxin

(Stu., Berk.)

Jegonzal

(PstDoc, Berk.)

Franklin

(Prof., Berk)

Istoica

(Prof., Berk)

Edge Table

J

SrcId

DstId

Attribute (E)

rxin

jegonzal

Friend

franklin

rxin

Advisor

istoica

franklin

Coworker

franklin

jegonzal

PI

Table Operators
Table (RDD) operators are inherited from Spark:
map

reduce

sample

filter

count

take

groupBy

fold

first

sort

reduceByKey

partitionBy

union

groupByKey

mapWith

join

cogroup

pipe

leftOuterJoin

cross

save

rightOuterJoin

zip

...

Graph Operators
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views ----------------def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Computation ---------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]

// Convenience functions -----------------------------def mapV(m: (Id, V) => T ): Graph[T,E]


def mapE(m: Edge[V,E] => T ): Graph[V,T]
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
}

The GraphX Stack"


(Lines of Code)
PageRank
(5)

Connected Shortest SVD



Comp. (10)
Path (10)
(40)

ALS

(40)

Pregel API (28)



GraphX (3575)

Spark

K-core

(51)

Triangle

LDA

Count

(120)

(45)

Triplets Join Vertices and Edges


The triplets operator joins vertices and edges:
SELECT src.Id, dst.Id, src.attr, e.attr, dst.attr
FROM edges AS e JOIN vertices AS src, vertices AS dst
ON e.srcId = src.Id AND e.dstId = dst.Id
Vertices:

A

B

Edges:
A

Triplets:
A

The mrTriplets operator sums adjacent triplets.



SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId

Map Reduce Triplets


Map-Reduce for each vertex

mapF(

A1

mapF(

A2

reduceF(

A1

A2

A
F

Example: Oldest Follower


What is the age of the oldest follower for each user?
23

42

val oldestFollowerAge = graph


B
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices

C
30

19

16

75

PageRank in GraphX

Graphs are
first-class
objects

// Load and initialize the graph!


val graph = GraphLoader.edgeListFile(hdfs://web.txt)!
val prGraph = graph.joinVertices(graph.outDegrees)!
// Implement and Run PageRank!
val pageRank = !
prGraph.pregel(initialMessage = 0.0, iter = 10)(!
(oldV, msgSum) => 0.15 + 0.85 * msgSum,!
triplet => triplet.src.pr / triplet.src.deg,!
(msgA, msgB) => msgA + msgB)!
// Get the top 20 pages!
pageRank.vertices.top(20)(Ordering.by(_._2)).foreach(println)!

27

Implementation

Distributed Graphs as Tables (RDDs)


Property Graph

Part. 1

A
Cut Heuristic
D

D

A

2D Vertex
A

D

Vertex
Table
(RDD)

E

Part. 2

Edge Table

(RDD)

A

1 2

1 2

Routing

Table
(RDD)

Caching for Iterative mrTriplets


Vertex
Table
(RDD)

A

A

B

B

C

C

D

D

E

E

F

F

Edge Table

(RDD)

Mirror

Cache

A

B

C

D

Mirror

Cache

A

D

E

F

mrTriplets Execution"

Vertex
Table
(RDD)

Mirror

Cache

A

Change

Change

Change

Edge Table

(RDD)

A

A

Local

Aggregate

B

C

D

Mirror

Cache

Change

Scan

Local

Aggregate

Optimizations
1.
2.
3.
4.
5.

Incremental Updates to Mirror Caches


Join Elimination
Index Scanning for Active Sets
Local Vertex and Edge Indices
Index and Routing Table Reuse

1. Incremental Updates for Mirror Caches



Vertex
Table
(RDD)

Change

A

B

C

D

Edge Table

(RDD)

Mirror

Cache

A

B

C

D

Mirror

Cache

A

Change

E

F

Reduction in Communication Due to


Cached Updates
Connected Components on Twitter Graph (1.4B edges)

Network Comm. (MB)

10000

1000

100

10

1

Most vertices are within 8 hops


of all vertices in their comp.

0.1

0

8

Iteration

10

12

14

16

2. Join Elimination
Vertex
Table
(RDD)

A

A

B

B

Edge Table

(RDD)

Mirror

Cache

A

B

C

C

C

D

D

Mirror

Cache

E

E

F

F

Reduction in Communication Due to


Join Elimination
UDF byte code inspection for join elimination:
>Identify and bypass joins for unused triplets fields
>Example: PageRank only accesses source attribute

Communication (MB)

PageRank on Twitter (1.4B edges)



Three Way Join

14000

12000

10000

8000

6000

4000

2000

0

Join Elimination

Factor of 2 reduction in communication



0

10

Iteration

15

20

36

3. Index Scanning for Active Sets


Connected Components on Twitter (1.4B edges)

Runtime (Seconds)

30

Scan

25

Indexed

20

15

Scan All Edges


10

Index of Active Edges


5

0

0

8

Iteration

10

12

14

16

Additional Query Optimizations


Indexing and Bitmaps:
>To accelerate joins across graphs
>To efficiently construct sub-graphs

Substantial Index and Data Reuse:


>Split Hash Table: reuse key sets across hash tables
>Reuse routing tables across graphs and sub-graphs
>Reuse edge adjacency information and indices
38

Multi-System Comparison
PageRank

Connected Components

uk-2007-05 web graph (3.7B edges)

Fault Tolerance

GraphX Training
Raw

Wikipedia

<</ />">"
< / >"

Text

Table

XML"

Title
Body

Hyperlinks

Berkeley
subgraph

PageRank

Top 20 Pages

Title
PR

Demo

Future Directions
Computation on time-varying graphs
Graph serving
Operating on compressed graphs

Thanks!

http://spark.apache.org/graphx

{jegonzal, rxin, ankurd, crankshaw}@eecs.berkeley.edu

Das könnte Ihnen auch gefallen