Sie sind auf Seite 1von 47

Distributed k-median and k-means Clustering

Yingyu Liang

Joint work with Maria Florina Balcan, Steven Ehrlich


Georgia Institute of Technology

November 26, 2013


Outline

1 Introduction

2 Non-distributed Algorithms

3 Distributed Clustering
Clustering Comes Up Everywhere

Cluster news articles or web pages by topic

Cluster images by who is in them


Clustering Task

Partition objects into groups so that


those in the same group are similar
those in different groups are not similar

sports fashion
k-median/k-means Clustering

A set P of N objects, represented as points in Rd

sports fashion

P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
k-median/k-means Clustering

A set P of N objects, represented as points in Rd

sports x2 fashion
x1

P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
k-median/k-means Clustering

A set P of N objects, represented as points in Rd

sports x2 fashion
x1

P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
Outline

1 Introduction

2 Non-distributed Algorithms
Lloyds Method
Local Search

3 Distributed Clustering
Lloyds Method for k-means

Lloyds Method for k-means

1 Initialize a set of centers {x1 , . . . , xk }


2 Assign each point to its nearest center
3 Update each center to the mean of points assigned to it
4 Repeat Assign-Update until converged
Lloyds Method for k-means

Initialization
Lloyds Method for k-means

Assign
Lloyds Method for k-means

Update
Lloyds Method for k-means
Local Search: Single Swap

Single Swap

1 Initialize a set of centers x = {x1 , . . . , xk }


2 While exist xi and p P with cost(P, x xi + p) < cost(P, x)
x x xi + p
Single Swap

Initialization
Single Swap

Iteration 1
Single Swap

Iteration 2
Outline

1 Introduction

2 Non-distributed Algorithms

3 Distributed Clustering
Global Summary Construction
Communication on General Topologies
Experiments
Modern Challenge: Distributed Data

Distributed databases
Images and videos on the Internet
Sensor networks
...
Distributed Clustering

Communication graph G with n nodes and m edges:


an edge indicates that the two nodes can communicate
Global data P is divided into local data sets P1 , . . . , Pn

Goal: efficient distributed algorithm for k-median/k-means


with guarantees for clustering cost and communication cost
Related Work

1 Direct adaptation of non-distributed algorithms,


e.g. Lloyds method [Forman et al., 2000; Datta et al., 2005]
no consideration on the communication cost
2 Transmitting summaries of local data to central coordinator
[Januzaj et al., 2003; Kargupta et al., 2001]
no guarantee on clustering cost
not for general communication topologies
Our Results

A distributed algorithm for k-median/k-means that


1 produces (1 + )-approximation, using any -approximation
non-distributed algorithm as a subroutine
2
with total communication cost on star graph: O(kd + nk)
3 with total communication cost on general topologies
independent of #points N
linear in #clusters k and the dimension d
linear in #nodes n and #edges m
Our Results

Two stages of our distributed algorithm


1 Constructs a global summary of the data
each node constructs a local portion of the summary
2 Compute approximation solution on the summary
each node broadcasts its local portion
Coreset

Weighted points whose cost approximates that of the original data


Coreset [Har-Peled and Mazumdar, 2004]
An -coreset for a set of points P with respect to a cost objective
function is a set of points D and a set of weights w: D R such
that for any set of centers x,
X
(1 )cost(P, x) wp cost(p, x) (1 + )cost(P, x).
pD
Coreset Construction in the Non-distributed Setting

Coreset construction [Feldman and Langberg, 2011]


1 Compute a constant approximation solution A
2 Sample points S with probability proportional to cost(p, A)
3 Let the coreset D = S A (with weights specified later)
Direct Adaptation in Distributed Setting

COMBINE

1 Compute a coreset for each local data set


2 Combine these local coresets to get a global coreset

Need to transmit n coresets


Can we do with 1 coreset?
Distributed Coreset Construction

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi


2 Broadcast the costs cost(Pi , Ai )
3 Let P|Si | = Pcost(Pi ,Ai ) ;
j |Sj | j cost(Pj ,Aj )
Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai ) (with weights specified later)
Distributed Coreset Construction

Algorithm 1: Distributed coreset construction

1 Compute
the size aofconstant
the localapproximation solution Atoi for
sample is proportional thePlocal
i cost
2 Broadcast the costs cost(Pi , Ai )
3 Let P|Si | = Pcost(Pi ,Ai ) ;
j |Sj | j cost(Pj ,Aj )
Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai ) (with weights specified later)
Distributed Coreset Construction

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi


2 Broadcast the costs cost(Pi , Ai )
3 Let P|Si | = Pcost(Pi ,Ai ) ;
j |Sj | j cost(Pj ,Aj )
Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai ) (with weights specified later)
Coreset Construction Analysis
Intuition for Sampling

Sample a set S uniformly at random from P .


Let B(x, r) = {p : d(x, p) r}.

|B(x,r)S| |B(x,r)P |
For fixed B(x, r), w.h.p. |S| = |P | 

when |S| = O(1/ 2)

B( x , r )
Coreset Construction Analysis
Intuition for Sampling

Sample a set S uniformly at random from P .


Let B(x, r) = {p : d(x, p) r}.

For any B(x, r), w.h.p. |B(x,r)S|


|S| = |B(x,r)P
|P |
|


when |S| = O(log[#distinct B(x, r) P ]/ ) 2

B( x ' , r ')
B( x , r )
Coreset Construction Analysis

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi ;


2 Broadcast the costs cost(Pi , Ai )
3 Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai )

By a sampling lemma on general function spaces:



WhenP|S| = O(kd), P
x, pP cost(p, x) pD wp cost(p, x)
Coreset Construction Analysis

Algorithm 1: Distributed coreset construction

1 Compute a constant approximation solution Ai for Pi ;


2 Broadcast the costs cost(Pi , Ai )
3 Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai )

Theorem (Distributed Coreset Construction)


Algorithm 1 produces an -coreset. The size of the coreset is

O(kd + nk) for constant .
Star Graph

Just send the local portions to the central coordinator.


Total communication cost: the size of the coreset


Our algorithm: O(kd + nk)
COMBINE: O(nkd)
General Communication Topologies

Message-Passing
B Broadcast messages {Ij }nj=1 , where Ij is on node j
On each node i do:
1 Initialize Ri = {Ii } and send Ii to all neighbors.
2 If receive Ij 6 Ri ,
then Ri = Ri {Ij } and send Ij to all neighbors.
Pn
Total communication cost: O(m j=1 |Ij |)
Distributed Clustering on General Topologies

Algorithm 2: Distributed Clustering

1 Call the distributed coreset construction algorithm


2 Broadcast the local coreset portions by Message-Passing
3 Compute an approximation solution on the coreset

Theorem (Distributed Clustering on General Graphs)


Given any -approximation algorithm as a subroutine, Algorithm 2
computes a (1 + )-approximation solution.

The total communication cost is O(m(kd + nk)) for constant .
Distributed Clustering on General Topologies


Our algorithm: O(m(kd + nk))
COMBINE: O(mnkd)
Experiment Setup

Data set: YearPredictionMSD ( 0.5m points in R90 )


Communication graphs: random, grid, preferential
Partition into 100 local data sets;
Partition methods: uniform, weighted, similarity/degree-based
Evaluation criteria:
k-means cost (k = 50) at the same communication budget
Experiments for Distributed Clustering
On Graphs

1.2 1.2
Our Algo
1.18 1.18
COMBINE
k-means cost ratio

1.16 1.16
1.15
1.14 1.14
1.12 1.12
1.1 1.1 1.1
1.08 1.08
1.06 1.06
1.05
1.04 1.04
1.6 1.8 2 2.2 1.7 1.8 1.9 2 2.1 2.2 2.3 1.6 1.7 1.8 1.9 2 2.1 2.2
107 107 107

random graph, uniform random graph, similarity-based random graph, weighted


k-means cost ratio

1.15 1.15 1.15

1.1 1.1 1.1

1.05 1.05 1.05


2 2.2 2.4 2.6 2.8 2 2.2 2.4 2.6 2.8 2.2 2.4 2.6 2.8
communication cost 106 communication cost 106 communication cost 106

grid graph, similarity-based grid graph, weighted preferential graph, degree-based


Thanks!
Feldman, D. and Langberg, M. (2011).
A unified framework for approximating and clustering data.
In Proceedings of the Annual ACM Symposium on Theory of
Computing.
Har-Peled, S. and Mazumdar, S. (2004).
On coresets for k-means and k-median clustering.
In Proceedings of the Annual ACM Symposium on Theory of
Computing.
Coreset Construction Analysis
Intuition for Sampling

Sample a set S uniformly at random from P .


Let B(x, r) = {p : d(x, p) r}.

For fixed B(x, r), w.h.p. |B(x,r)S|


|S| = |B(x,r)P |
|P | 
when the sample S is large enough

B( x , r )
Coreset Construction Analysis
Intuition for Sampling

Sample a set S uniformly at random from P .


Let B(x, r) = {p : d(x, p) r}.

For any B(x, r), w.h.p. |B(x,r)S|


|S| = |B(x,r)P
|P |
|


when |S| = O(log[#distinct B(x, r) P ]/ ) 2

B( x ' , r ')
B( x , r )
Coreset Construction Analysis
Sampling Lemma for General Functions

Let F be a set of functions from P to R0 .


For f F , let B(f, r) = {p : f (p) r}.
Special case:B(fx , r) = B(x, r) when fx (p) = d(x, p)
Coreset Construction Analysis
Sampling Lemma for General Functions

Let F be a set of functions from P to R0 .


For f F , let B(f, r) = {p : f (p) r}.
Sampling Lemma (weighted sampling, general functions)
Let mp = maxf F f (p). Sample SPfrom P with probability
mq
proportional to mp , and let wp = mqp |S| .

If |S| = O(log[#distinct B(f, r) P ]/2 ), then w.h.p.


X X X
f F, f (p) wp f (p)  mp .
pP pS pP

Proof idea: replace p with mp copies p0 ; let f (p0 ) = f (p)/mp


Coreset Construction Analysis
Sampling Lemma for k-median

Natural attempt: fx (p) = cost(p, x)


Fail since fx (p) unbounded
Another attempt:
For p Pi , let bp denote its nearest center in Ai .
Set fx (p) = cost(p, x) cost(bp , x), then mp = cost(p, Ai ).

cost (p ,x)
p x
mp
cost (b p , x)
bp
Coreset Construction Analysis
Sampling Lemma for k-median

For p Pi , let bp denote its nearest center in Ai .


Set fx (p) = cost(p, x) cost(bp , x), then mp = cost(p, bp ).

By Sampling Lemma,

X X X

x, fx (p) wp fx (p)  mp .
pP pS pP

P P
=| pP cost(p, x) pD wp cost(p, x)|
P
= i cost(Pi , Ai ) = O()OP T

Das könnte Ihnen auch gefallen