Beruflich Dokumente
Kultur Dokumente
Yingyu Liang
1 Introduction
2 Non-distributed Algorithms
3 Distributed Clustering
Clustering Comes Up Everywhere
sports fashion
k-median/k-means Clustering
sports fashion
P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
k-median/k-means Clustering
sports x2 fashion
x1
P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
k-median/k-means Clustering
sports x2 fashion
x1
P
Find centers x = {x1 , . . . , xk } to minimize pP cost(p, x)
Each point p assigned to its nearest center
k-median cost: distance to the nearest center
k-means cost: square distance to the nearest center
Outline
1 Introduction
2 Non-distributed Algorithms
Lloyds Method
Local Search
3 Distributed Clustering
Lloyds Method for k-means
Initialization
Lloyds Method for k-means
Assign
Lloyds Method for k-means
Update
Lloyds Method for k-means
Local Search: Single Swap
Single Swap
Initialization
Single Swap
Iteration 1
Single Swap
Iteration 2
Outline
1 Introduction
2 Non-distributed Algorithms
3 Distributed Clustering
Global Summary Construction
Communication on General Topologies
Experiments
Modern Challenge: Distributed Data
Distributed databases
Images and videos on the Internet
Sensor networks
...
Distributed Clustering
COMBINE
1 Compute
the size aofconstant
the localapproximation solution Atoi for
sample is proportional thePlocal
i cost
2 Broadcast the costs cost(Pi , Ai )
3 Let P|Si | = Pcost(Pi ,Ai ) ;
j |Sj | j cost(Pj ,Aj )
Sample Si from Pi with probability proportional to cost(p, Ai )
S
4 Let the coreset D = i (Si Ai ) (with weights specified later)
Distributed Coreset Construction
|B(x,r)S| |B(x,r)P |
For fixed B(x, r), w.h.p. |S| = |P |
when |S| = O(1/ 2)
B( x , r )
Coreset Construction Analysis
Intuition for Sampling
B( x ' , r ')
B( x , r )
Coreset Construction Analysis
Our algorithm: O(kd + nk)
COMBINE: O(nkd)
General Communication Topologies
Message-Passing
B Broadcast messages {Ij }nj=1 , where Ij is on node j
On each node i do:
1 Initialize Ri = {Ii } and send Ii to all neighbors.
2 If receive Ij 6 Ri ,
then Ri = Ri {Ij } and send Ij to all neighbors.
Pn
Total communication cost: O(m j=1 |Ij |)
Distributed Clustering on General Topologies
Our algorithm: O(m(kd + nk))
COMBINE: O(mnkd)
Experiment Setup
1.2 1.2
Our Algo
1.18 1.18
COMBINE
k-means cost ratio
1.16 1.16
1.15
1.14 1.14
1.12 1.12
1.1 1.1 1.1
1.08 1.08
1.06 1.06
1.05
1.04 1.04
1.6 1.8 2 2.2 1.7 1.8 1.9 2 2.1 2.2 2.3 1.6 1.7 1.8 1.9 2 2.1 2.2
107 107 107
B( x , r )
Coreset Construction Analysis
Intuition for Sampling
B( x ' , r ')
B( x , r )
Coreset Construction Analysis
Sampling Lemma for General Functions
cost (p ,x)
p x
mp
cost (b p , x)
bp
Coreset Construction Analysis
Sampling Lemma for k-median
By Sampling Lemma,
X X X
x, fx (p) wp fx (p) mp .
pP pS pP
P P
=| pP cost(p, x) pD wp cost(p, x)|
P
= i cost(Pi , Ai ) = O()OP T