Distributed Clustering

Distributed k-median and k-means Clustering
Yingyu Liang
Joint work with Maria Florina Balcan, Steven Ehrlich

Georgia Institute of Technology
November 26, 2013

Outline
1 Introduction
2 Non-distributed Algorithms
3 Distributed Clustering
Clustering Comes Up Everywhere
Cluster news articles or web pages by topic
Cluster images by who is in them

Clustering Task
Partition objects into groups so that

those in the same group are similar
those in different groups are not similar
sports fashion
k-median/k-means Clustering
A set P of N objects, represented as points in Rd
sports fashion
P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
sports x2 fashion
x1
P
sports x2 fashion
x1
P
Outline
1 Introduction
Lloyds Method
Local Search
Lloyds Method for k-means
1 Initialize a set of centers {x1 , . . . , xk }

2 Assign each point to its nearest center
3 Update each center to the mean of points assigned to it
4 Repeat Assign-Update until converged
Initialization
Assign
Update
Local Search: Single Swap
Single Swap
1 Initialize a set of centers x = {x1 , . . . , xk }

2 While exist xi and p P with cost(P, x xi + p) < cost(P, x)
x x xi + p
Single Swap
Initialization
Single Swap
Iteration 1
Single Swap
Iteration 2
Outline
1 Introduction
Global Summary Construction
Communication on General Topologies
Experiments
Modern Challenge: Distributed Data
Distributed databases
Images and videos on the Internet
Sensor networks
...
Distributed Clustering
Communication graph G with n nodes and m edges:

an edge indicates that the two nodes can communicate
Global data P is divided into local data sets P1 , . . . , Pn
Goal: efficient distributed algorithm for k-median/k-means

with guarantees for clustering cost and communication cost
Related Work
1 Direct adaptation of non-distributed algorithms,

e.g. Lloyds method [Forman et al., 2000; Datta et al., 2005]
no consideration on the communication cost
2 Transmitting summaries of local data to central coordinator
[Januzaj et al., 2003; Kargupta et al., 2001]
no guarantee on clustering cost
not for general communication topologies
Our Results
A distributed algorithm for k-median/k-means that

1 produces (1 + )-approximation, using any -approximation
non-distributed algorithm as a subroutine
2
with total communication cost on star graph: O(kd + nk)
3 with total communication cost on general topologies
independent of #points N
linear in #clusters k and the dimension d
linear in #nodes n and #edges m
Our Results
Two stages of our distributed algorithm

1 Constructs a global summary of the data
each node constructs a local portion of the summary
2 Compute approximation solution on the summary
each node broadcasts its local portion
Coreset
Weighted points whose cost approximates that of the original data

Coreset [Har-Peled and Mazumdar, 2004]
An -coreset for a set of points P with respect to a cost objective
function is a set of points D and a set of weights w: D R such
that for any set of centers x,
X
(1 )cost(P, x) wp cost(p, x) (1 + )cost(P, x).
pD
Coreset Construction in the Non-distributed Setting
Coreset construction [Feldman and Langberg, 2011]

1 Compute a constant approximation solution A
2 Sample points S with probability proportional to cost(p, A)
3 Let the coreset D = S A (with weights specified later)
Direct Adaptation in Distributed Setting
COMBINE
1 Compute a coreset for each local data set

2 Combine these local coresets to get a global coreset
Need to transmit n coresets

Can we do with 1 coreset?
Distributed Coreset Construction
Algorithm 1: Distributed coreset construction
1 Compute a constant approximation solution Ai for Pi

2 Broadcast the costs cost(Pi , Ai )
3 Let P|Si | = Pcost(Pi ,Ai ) ;
j |Sj | j cost(Pj ,Aj )
Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai ) (with weights specified later)
1 Compute
the size aofconstant
the localapproximation solution Atoi for
sample is proportional thePlocal
i cost
S
1 Compute a constant approximation solution Ai for Pi

S
Coreset Construction Analysis
Intuition for Sampling
Sample a set S uniformly at random from P .

Let B(x, r) = {p : d(x, p) r}.
|B(x,r)S| |B(x,r)P |
For fixed B(x, r), w.h.p. |S| = |P |

when |S| = O(1/ 2)
B( x , r )

Let B(x, r) = {p : d(x, p) r}.
For any B(x, r), w.h.p. |B(x,r)S|

|S| = |B(x,r)P
|P |
|

when |S| = O(log[#distinct B(x, r) P ]/ ) 2
B( x ' , r ')
B( x , r )
1 Compute a constant approximation solution Ai for Pi ;

3 Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai )
By a sampling lemma on general function spaces:

WhenP|S| = O(kd), P
x, pP cost(p, x) pD wp cost(p, x)
1 Compute a constant approximation solution Ai for Pi ;

3 Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai )
Theorem (Distributed Coreset Construction)

Algorithm 1 produces an -coreset. The size of the coreset is

O(kd + nk) for constant .
Star Graph
Just send the local portions to the central coordinator.

Total communication cost: the size of the coreset

Our algorithm: O(kd + nk)
COMBINE: O(nkd)
General Communication Topologies
Message-Passing
B Broadcast messages {Ij }nj=1 , where Ij is on node j
On each node i do:
1 Initialize Ri = {Ii } and send Ii to all neighbors.
2 If receive Ij 6 Ri ,
then Ri = Ri {Ij } and send Ij to all neighbors.
Pn
Total communication cost: O(m j=1 |Ij |)
Distributed Clustering on General Topologies
Algorithm 2: Distributed Clustering
1 Call the distributed coreset construction algorithm

2 Broadcast the local coreset portions by Message-Passing
3 Compute an approximation solution on the coreset
Theorem (Distributed Clustering on General Graphs)

Given any -approximation algorithm as a subroutine, Algorithm 2
computes a (1 + )-approximation solution.

The total communication cost is O(m(kd + nk)) for constant .
Distributed Clustering on General Topologies

Our algorithm: O(m(kd + nk))
COMBINE: O(mnkd)
Experiment Setup
Data set: YearPredictionMSD ( 0.5m points in R90 )

Communication graphs: random, grid, preferential
Partition into 100 local data sets;
Partition methods: uniform, weighted, similarity/degree-based
Evaluation criteria:
k-means cost (k = 50) at the same communication budget
Experiments for Distributed Clustering
On Graphs
1.2 1.2
Our Algo
1.18 1.18
COMBINE
k-means cost ratio
1.16 1.16
1.15
1.14 1.14
1.12 1.12
1.1 1.1 1.1
1.08 1.08
1.06 1.06
1.05
1.04 1.04
1.6 1.8 2 2.2 1.7 1.8 1.9 2 2.1 2.2 2.3 1.6 1.7 1.8 1.9 2 2.1 2.2
107 107 107
random graph, uniform random graph, similarity-based random graph, weighted

k-means cost ratio
1.15 1.15 1.15
1.1 1.1 1.1
1.05 1.05 1.05

2 2.2 2.4 2.6 2.8 2 2.2 2.4 2.6 2.8 2.2 2.4 2.6 2.8
communication cost 106 communication cost 106 communication cost 106
grid graph, similarity-based grid graph, weighted preferential graph, degree-based

Thanks!
Feldman, D. and Langberg, M. (2011).
A unified framework for approximating and clustering data.
In Proceedings of the Annual ACM Symposium on Theory of
Computing.
Har-Peled, S. and Mazumdar, S. (2004).
On coresets for k-means and k-median clustering.
In Proceedings of the Annual ACM Symposium on Theory of
Computing.

Let B(x, r) = {p : d(x, p) r}.
For fixed B(x, r), w.h.p. |B(x,r)S|

|S| = |B(x,r)P |
|P |
when the sample S is large enough
B( x , r )

Let B(x, r) = {p : d(x, p) r}.
For any B(x, r), w.h.p. |B(x,r)S|

|S| = |B(x,r)P
|P |
|

when |S| = O(log[#distinct B(x, r) P ]/ ) 2
B( x ' , r ')
B( x , r )
Sampling Lemma for General Functions
Let F be a set of functions from P to R0 .

For f F , let B(f, r) = {p : f (p) r}.
Special case:B(fx , r) = B(x, r) when fx (p) = d(x, p)
Sampling Lemma for General Functions
Let F be a set of functions from P to R0 .

For f F , let B(f, r) = {p : f (p) r}.
Sampling Lemma (weighted sampling, general functions)
Let mp = maxf F f (p). Sample SPfrom P with probability
mq
proportional to mp , and let wp = mqp |S| .

If |S| = O(log[#distinct B(f, r) P ]/2 ), then w.h.p.

X X X
f F, f (p) wp f (p) mp .
pP pS pP
Proof idea: replace p with mp copies p0 ; let f (p0 ) = f (p)/mp

Sampling Lemma for k-median
Natural attempt: fx (p) = cost(p, x)

Fail since fx (p) unbounded
Another attempt:
For p Pi , let bp denote its nearest center in Ai .
Set fx (p) = cost(p, x) cost(bp , x), then mp = cost(p, Ai ).
cost (p ,x)
p x
mp
cost (b p , x)
bp
Sampling Lemma for k-median
For p Pi , let bp denote its nearest center in Ai .

Set fx (p) = cost(p, x) cost(bp , x), then mp = cost(p, bp ).
By Sampling Lemma,

X X X

x, fx (p) wp fx (p) mp .
pP pS pP
P P
=| pP cost(p, x) pD wp cost(p, x)|
P
= i cost(Pi , Ai ) = O()OP T

Distributed Clustering

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Distributed Clustering

Hochgeladen von

Copyright:

Verfügbare Formate

Distributed k-median and k-means Clustering

Joint work with Maria Florina Balcan, Steven Ehrlich

November 26, 2013

Cluster news articles or web pages by topic

Cluster images by who is in them

Partition objects into groups so that

A set P of N objects, represented as points in Rd

A set P of N objects, represented as points in Rd

A set P of N objects, represented as points in Rd

Lloyds Method for k-means

1 Initialize a set of centers {x1 , . . . , xk }

1 Initialize a set of centers x = {x1 , . . . , xk }

Communication graph G with n nodes and m edges:

Goal: efficient distributed algorithm for k-median/k-means

1 Direct adaptation of non-distributed algorithms,

A distributed algorithm for k-median/k-means that

Two stages of our distributed algorithm

Weighted points whose cost approximates that of the original data

Coreset construction [Feldman and Langberg, 2011]

1 Compute a coreset for each local data set

Need to transmit n coresets

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi

Algorithm 1: Distributed coreset construction

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi

Sample a set S uniformly at random from P .

Sample a set S uniformly at random from P .

For any B(x, r), w.h.p. |B(x,r)S|

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi ;

By a sampling lemma on general function spaces:

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi ;

Theorem (Distributed Coreset Construction)

Just send the local portions to the central coordinator.

Algorithm 2: Distributed Clustering

1 Call the distributed coreset construction algorithm

Theorem (Distributed Clustering on General Graphs)

Data set: YearPredictionMSD ( 0.5m points in R90 )

random graph, uniform random graph, similarity-based random graph, weighted

1.15 1.15 1.15

1.1 1.1 1.1

1.05 1.05 1.05

grid graph, similarity-based grid graph, weighted preferential graph, degree-based

Sample a set S uniformly at random from P .

For fixed B(x, r), w.h.p. |B(x,r)S|

Sample a set S uniformly at random from P .

For any B(x, r), w.h.p. |B(x,r)S|

Let F be a set of functions from P to R0 .

Let F be a set of functions from P to R0 .

Proof idea: replace p with mp copies p0 ; let f (p0 ) = f (p)/mp

Natural attempt: fx (p) = cost(p, x)

For p Pi , let bp denote its nearest center in Ai .

Das könnte Ihnen auch gefallen