Graphx@Sparksummit 2014 07

GRAPHX: UNIFIED GRAPH
ANALYTICS ON SPARK
Joseph Gonzalez, Reynold Xin, Ankur Dave, Daniel Crankshaw,
Michael Franklin, and Ion Stoica
*These slides are best viewed in PowerPoint with animation.

1. Motivation for GraphX

2. GraphX Implementation
3. Future Directions
Motivation for GraphX
Graphs are Central to Analytics

Hyperlinks

Raw

Wikipedia

XML"
Top 20 Pages

Title
PR

Text

Table

Title
Body

<</ />">"
< / >"
PageRank

Term-Doc

Graph

Topic Model

(LDA)

Word Topics

Word
Topic

Discussion

Table

User
Disc.

Community

User

Editor Graph
Detection
Community

User
Com.

Community

Topic

Topic
Com.

PageRank: Identifying Leaders

R[i] = 0.15 +
wji R[j]
j2Nbrs(i)
Rank of
user i

Iterate until convergence
Update ranks in parallel
Weighted sum of
neighbors ranks

PageRank
R[i] = 0.15 +
wji R[j]
j2Nbrs(i)
0.5

0.5

3.65

1
1

1
Single-Source Shortest Path

D[i] = 1 + arg min D[j]
j2Nbrs(i)
0
3
2
The Graph-Parallel Pattern
Model / Alg.
State
Computation depends
only on the neighbors
Many Graph-Parallel Algorithms

Collaborative Filtering

> Alternating Least Squares

> Stochastic Gradient Descent

> Tensor Factorization

Structured Prediction

> Loopy Belief Propagation

> Max-Product Linear Programs

> Gibbs Sampling

Semi-supervised ML

> Graph SSL

> CoEM

Community Detection

> Triangle-Counting

> K-core Decomposition

> K-Truss

Graph Analytics

>
>
>
>
PageRank

Personalized PageRank

Shortest Path

Graph Coloring

Classification

> Neural Networks

Graph-Parallel Systems
oogle
Expose specialized APIs to simplify graph

programming.

Exploit graph structure to achieve orders-ofmagnitude performance gains over more general
data-parallel systems.

Specialized API: Pregel

Vertex-Programs interact by sending messages.

Pregel_PageRank(i, messages) :
// Receive all the messages
total = 0
foreach( msg in messages) :
total = total + msg

// Update the rank of this vertex
R[i] = 0.15 + total

// Send new messages to neighbors
foreach(j in out_neighbors[i]) :
Send msg(R[i]) to vertex j
Malewicz et al. [PODC09, SIGMOD10]

PageRank on LiveJournal Graph

(69M edges)
Mahout/Hadoop

1340

354

Nave Spark

22

GraphLab

0

200

400

600

800

1000

1200

1400

Runtime (in seconds, PageRank for 10 iterations)

Spark is 4x faster than Hadoop

GraphLab is 16x faster than Spark

1600

Specialized Systems Miss the

Bigger Picture Hyperlinks
PageRank
Top 20 Pages

Raw

Wikipedia

Text

Table

Title
Body

<</ />">"
< / >"
XML"
Title
PR

Term-Doc

Graph

Topic Model

(LDA)

Word Topics

Word
Topic

Discussion

Table

User
Disc.

Community

User

Editor Graph
Detection
Community

User
Com.

Community

Topic

Topic
Com.

Separate Systems to Support Each View

6. Before
Data-Parallel

Graph-Parallel

7. After
8. After
Table

Dependency Graph

Row

Row

Row

Row

Result

Having separate systems !

for each view is !
difficult to use and inefficient
Difficult to Program and Use

Users must Learn, Deploy, and Manage
multiple systems

Leads to brittle and often
complex interfaces

Inefficient
Extensive data movement and duplication across
the network and file system

<</ />">"
< / >"
XML"
HDFS
HDFS
HDFS
HDFS
Limited reuse internal data-structures

across stages

Solution: The GraphX Unified Approach
New API

Blurs the distinction between

Tables and Graphs

New Library

Embeds Graph-Parallel
model in Spark

Enabling users to easily and efficiently

express the entire graph analytics pipeline

Tables and Graphs are composable "

views of the same physical data
Table View

GraphX Unified

Representation

Graph View

Each view has its own operators that

exploit the semantics of the view

to achieve efficient execution

View a Graph as a Table

Vertex Table

Property Graph

R
Id

Attribute (V)

Rxin

(Stu., Berk.)

Jegonzal

(PstDoc, Berk.)

Franklin

(Prof., Berk)

Istoica

(Prof., Berk)

Edge Table

J
SrcId

DstId

Attribute (E)

rxin

jegonzal

Friend

franklin

rxin

Advisor

istoica

franklin

Coworker

franklin

jegonzal

PI

Table Operators
Table (RDD) operators are inherited from Spark:
map
reduce
sample
filter
count
take
groupBy
fold
first
sort
reduceByKey
partitionBy
union
groupByKey
mapWith
join
cogroup
pipe
leftOuterJoin
cross
save
rightOuterJoin
zip
...
Graph Operators
class Graph [ V, E ] {
def Graph(vertices: Table[ (Id, V) ],
edges: Table[ (Id, Id, E) ])
// Table Views ----------------def vertices: Table[ (Id, V) ]
def edges: Table[ (Id, Id, E) ]
def triplets: Table [ ((Id, V), (Id, V), E) ]
// Computation ---------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],
reduceF: (T, T) => T): Graph[T, E]
// Convenience functions -----------------------------def mapV(m: (Id, V) => T ): Graph[T,E]

def mapE(m: Edge[V,E] => T ): Graph[V,T]
def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]
def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]
def reverse: Graph[V, E]
def subgraph(pV: (Id, V) => Boolean,
pE: Edge[V,E] => Boolean): Graph[V,E]
}
The GraphX Stack"

(Lines of Code)
PageRank
(5)

Connected Shortest SVD

Comp. (10)
Path (10)
(40)

ALS

(40)

Pregel API (28)

GraphX (3575)

Spark

K-core

(51)

Triangle

LDA

Count

(120)

(45)

Triplets Join Vertices and Edges

The triplets operator joins vertices and edges:
SELECT src.Id, dst.Id, src.attr, e.attr, dst.attr
FROM edges AS e JOIN vertices AS src, vertices AS dst
ON e.srcId = src.Id AND e.dstId = dst.Id
Vertices:

A

B

Edges:
A

Triplets:
A

The mrTriplets operator sums adjacent triplets.

SELECT t.dstId, reduceUDF( mapUDF(t) ) AS sum
FROM triplets AS t GROUPBY t.dstId
Map Reduce Triplets

Map-Reduce for each vertex
mapF(
A1
mapF(
A2
reduceF(
A1
A2
A
F
Example: Oldest Follower

What is the age of the oldest follower for each user?
23

42

val oldestFollowerAge = graph

B
.mrTriplets(
e=> (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
C
30

19

16

75

PageRank in GraphX

Graphs are
first-class
objects

// Load and initialize the graph!

val graph = GraphLoader.edgeListFile(hdfs://web.txt)!
val prGraph = graph.joinVertices(graph.outDegrees)!
// Implement and Run PageRank!
val pageRank = !
prGraph.pregel(initialMessage = 0.0, iter = 10)(!
(oldV, msgSum) => 0.15 + 0.85 * msgSum,!
triplet => triplet.src.pr / triplet.src.deg,!
(msgA, msgB) => msgA + msgB)!
// Get the top 20 pages!
pageRank.vertices.top(20)(Ordering.by(_._2)).foreach(println)!
27
Implementation
Distributed Graphs as Tables (RDDs)

Property Graph

Part. 1

A
Cut Heuristic
D

D

A

2D Vertex
A

D

Vertex
Table
(RDD)

E

Part. 2

Edge Table

(RDD)

A

1 2
1 2
Routing

Table
(RDD)

Caching for Iterative mrTriplets

Vertex
Table
(RDD)

A

A

B

B

C

C

D

D

E

E

F

F

Edge Table

(RDD)

Mirror

Cache

A

B

C

D

Mirror

Cache

A

D

E

F

mrTriplets Execution"

Vertex
Table
(RDD)

Mirror

Cache

A

Change

Change

Change

Edge Table

(RDD)

A

A

Local

Aggregate

B

C

D

Mirror

Cache

Change

Scan

Local

Aggregate

Optimizations
1.
2.
3.
4.
5.
Incremental Updates to Mirror Caches

Join Elimination
Index Scanning for Active Sets
Local Vertex and Edge Indices
Index and Routing Table Reuse
1. Incremental Updates for Mirror Caches

Vertex
Table
(RDD)

Change

A

B

C

D

Edge Table

(RDD)

Mirror

Cache

A

B

C

D

Mirror

Cache

A

Change

E

F

Reduction in Communication Due to

Cached Updates
Connected Components on Twitter Graph (1.4B edges)

Network Comm. (MB)

10000

1000

100

10

1

Most vertices are within 8 hops

of all vertices in their comp.

0.1

0

8

Iteration

10

12

14

16

2. Join Elimination
Vertex
Table
(RDD)

A

A

B

B

Edge Table

(RDD)

Mirror

Cache

A

B

C

C

C

D

D

Mirror

Cache

E

E

F

F

Reduction in Communication Due to

Join Elimination
UDF byte code inspection for join elimination:
>Identify and bypass joins for unused triplets fields
>Example: PageRank only accesses source attribute
Communication (MB)

PageRank on Twitter (1.4B edges)

Three Way Join

14000

12000

10000

8000

6000

4000

2000

0

Join Elimination

Factor of 2 reduction in communication

0

10

Iteration

15

20

36
3. Index Scanning for Active Sets

Connected Components on Twitter (1.4B edges)

Runtime (Seconds)

30

Scan

25

Indexed

20

15

Scan All Edges

10

Index of Active Edges

5

0

0

8

Iteration

10

12

14

16

Additional Query Optimizations

Indexing and Bitmaps:
>To accelerate joins across graphs
>To efficiently construct sub-graphs
Substantial Index and Data Reuse:

>Split Hash Table: reuse key sets across hash tables
>Reuse routing tables across graphs and sub-graphs
>Reuse edge adjacency information and indices
38
Multi-System Comparison
PageRank

Connected Components

uk-2007-05 web graph (3.7B edges)
Fault Tolerance
GraphX Training
Raw

Wikipedia

<</ />">"
< / >"
Text

Table

XML"
Title
Body

Hyperlinks

Berkeley
subgraph

PageRank

Top 20 Pages

Title
PR

Demo
Future Directions
Computation on time-varying graphs
Graph serving
Operating on compressed graphs
Thanks!
http://spark.apache.org/graphx

{jegonzal, rxin, ankurd, crankshaw}@eecs.berkeley.edu

Graphx@Sparksummit 2014 07

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Graphx@Sparksummit 2014 07

Hochgeladen von

Copyright:

Verfügbare Formate

GRAPHX: UNIFIED GRAPH

*These slides are best viewed in PowerPoint with animation.

1. Motivation for GraphX

Motivation for GraphX

Graphs are Central to Analytics

PageRank: Identifying Leaders

Single-Source Shortest Path

The Graph-Parallel Pattern

Many Graph-Parallel Algorithms

Expose specialized APIs to simplify graph

Specialized API: Pregel

Malewicz et al. [PODC09, SIGMOD10]

PageRank on LiveJournal Graph

Runtime (in seconds, PageRank for 10 iterations)

Spark is 4x faster than Hadoop

Specialized Systems Miss the

Separate Systems to Support Each View

Having separate systems !

Difficult to Program and Use

Limited reuse internal data-structures 

Solution: The GraphX Unified Approach

Blurs the distinction between

Enabling users to easily and efficiently

Tables and Graphs are composable "

Each view has its own operators that 

View a Graph as a Table

// Convenience functions -----------------------------def mapV(m: (Id, V) => T ): Graph[T,E]

The GraphX Stack"

Connected Shortest SVD

Pregel API (28)

Triplets Join Vertices and Edges

The mrTriplets operator sums adjacent triplets.

Map Reduce Triplets

Example: Oldest Follower

val oldestFollowerAge = graph

// Load and initialize the graph!

Distributed Graphs as Tables (RDDs)

Caching for Iterative mrTriplets

Incremental Updates to Mirror Caches

1. Incremental Updates for Mirror Caches

Reduction in Communication Due to

Most vertices are within 8 hops

Reduction in Communication Due to

PageRank on Twitter (1.4B edges)

Factor of 2 reduction in communication

3. Index Scanning for Active Sets

Scan All Edges

Index of Active Edges

Additional Query Optimizations

Substantial Index and Data Reuse:

uk-2007-05 web graph (3.7B edges)

Das könnte Ihnen auch gefallen

Limited reuse internal data-structures

Each view has its own operators that

Most vertices are within 8 hops