Beruflich Dokumente
Kultur Dokumente
September p, zo++
The MinimumCode Length for
Clustering Using the Gray Code
Mahlto SUG|AMA
,
Aklhlro AMAMOTO
Kyoto Unlverslty
l
t
e
r
i
n
g
O
r
i
g
i
n
a
l
i
m
a
g
e
Delta Dragon Europe Norway Ganges
(/z6
Results (Real datasets)
G
-
C
O
O
L
K
-
m
e
a
n
s
Delta Dragon Europe Norway Ganges
(/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
/z6
Clustering Focusing on Compression
The MDL approach
|
Kontkanen et al., zoo
|
Data encodlng has to be optlmlzed
All encodlng schemes are (lmpllcltly) consldered
The tlme complexlty O(n
2
)
The Kolmogorov complexlty approach |Clllbrasl, zoo|
Measures the dlstance between data polnts based on com-
presslon of nlte sequences
Dlcult to apply multlvarlate data
Actual clusterlng process ls the tradltlonal agglomeratlve hl-
erarchlcal clusterlng
The tlme complexlty O(n
2
)
8oth approaches are not sultable for masslve data
6/z6
Our Strategy
Requirements:
+. Past, and llnear ln the data slze
z. Pobust to changes ln lnput parameters
. Can nd arbltrary shaped clusters
)/z6
Our Strategy
Requirements:
+. Past, and llnear ln the data slze
z. Pobust to changes ln lnput parameters
. Can nd arbltrary shaped clusters
Solutions:
+. Plx an encodlng scheme for contlnuous varlables
Motlvated by Computable Analysis
|
welhrauch, zooo
|
z. Clusterlng Dlscretlzlng real-valued data
Always nds the best results w.r.t. the MCL
. Use the Gray code for real numbers
|
Tsulkl, zooz
|
Dlscretlzed data polnts are overlapped and ad[acent clus-
ters are merged
)/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
)/z6
MCL (MinimumCode Length)
The MCL ls the code length of the maxlmally compressed
clusters by uslng a xed encodlng scheme
The MCL ls calculated ln O(nd) by uslng radlx sort
n and d are the number of data and dlmenslon, resp.
8/z6
MCL (MinimumCode Length)
The MCL ls the code length of the maxlmally compressed
clusters by uslng a xed encodlng scheme
The MCL ls calculated ln O(nd) by uslng radlx sort
n and d are the number of data and dlmenslon, resp.
Example: X }0.l, 0.2, 0.8, 0.9},
l
}}0.l, 0.2}, }0.8, 0.9}}
2
}}0.l}, }0.2, 0.8}, }0.9}}
Use blnary encodlng
whlch ls preferred!
8/z6
Binary Encoding
0
1
P
o
s
i
t
i
o
n
0
1
2
3
4
0.5
0.1 0.2
00011...
00110...
0.8 0.9
11001...
11100...
p/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL l + l 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL l + l 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL l + l 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
Denition of MCL
Plx an embeddlng :
d
( }0, 1} usually)
Por p range() and P range(), dene
(p P)
{
w
xC
i
((x) (X C
i
))
}
++/z6
Minimizing MCL and Clustering
Clusterlngunder the MCL crlterlonls tondthe global op-
tlmal solutlon that mlnlmlzes the MCL
Plnd
op
such that
op
argmln
(X)
K
MCL(),
where (X)
K
} ls a partltlon of X #C K }
we glve the lower bound of the number of clusters K as a
lnput parameter
op
becomes one set }X} wlthout thls assumptlon
+z/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
+z/z6
Optimization by COOL
COOL solves the optlmlzatlon problem ln O(nd)
n and d are the number of data and dlmenslon, resp.
The naive approach takes exponentlal tlme and space
Computlng process of the MCL becomes clusterlng process lt-
self vla dlscretlzatlon
COOL ls level-wlse, and makes the level-k partltlon
k
from k l, 2, , whlch holds the followlng condltlon:
Por all x, y X, they are ln the same cluster
v w for some v (x) and w (y) wlth |v| |w| k
Level-k partltlons form hlerarchy
Por C
k
, there exlsts
k+l
such that C
Por all C
op
, there exlsts k such that C
k
+/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
MCL l + l 2
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
ld value
A 00
8 01
C 10
D 10
L 11
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
ld value
A 00
8 01
C 10
D 10
L 11
MCL 2 4 8
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
100 101
Lv. 3
ld value
A 00
8 01
C 100
D 101
L 11
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
100 101
Lv. 3
ld value
A 00
8 01
C 100
D 101
L 11
MCL 6 + 6 l2
+(/z6
Noise Filtering by COOL
Nolse lterlng ls easlly lmplemented ln COOL
Dene
N
}C #C N} for a partltlon
See a cluster C as nolses lf #C < N
Lxample: Glven }}0.l}, }0.4, 0.5, 0.6}, }0.9}}
2
}}0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
we lnput the lower bound N of the cluster slze as a lnput
parameter
0 0.5 1 0.25 0.75
N = 2
+/z6
Noise Filtering by COOL
Nolse lterlng ls easlly lmplemented ln COOL
Dene
N
}C #C N} for a partltlon
See a cluster C as nolses lf #C < N
Lxample: Glven }}0.l}, }0.4, 0.5, 0.6}, }0.9}}
2
}}0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
we lnput the lower bound N of the cluster slze as a lnput
parameter
0 0.5 1 0.25 0.75
N = 2
+/z6
Algorithmof COOL
|nput: A data set X, two lower bounds K and N
Output: The optlmal partltlon
op
and nolses
functlon COOL(X, K, N)
+: Plnd partltlons
l
N
, ,
m
N
such that
ml
N
< K
m
N
z: (
op
, MCL(
op
)) P|NDCLUSTLPS(X, K, }
l
N
, ,
m
N
})
: return (
op
, X
op
)
functlon P|NDCLUSTLPS(X, K, }
l
, ,
m
})
+: Plnd k such that
kl
< K and
k
K
z:
op
k
: lf K 2 then return (
op
, MCL(
op
))
(: for each C ln
l
kl
: (, L) P|NDCLUSTLPS(X C, K l, }
l
, ,
k
})
6: lf MCL( C) < MCL(
op
) then
op
C
): return (
op
, MCL(
op
))
+6/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
+6/z6
Gray Code
Peal numbers ln |0, l| are encoded wlth 0, 1, and
8lnary: 0.l 00011, 0.25 00111
Gray: 0.l 00010, 0.25 0100
Orlglnally, another blnary encodlng of natural numbers
Lspeclally lmportant ln appllcatlons of converslon between
analog and dlgltal lnformatlon
|
Knuth, zoo
|
The Gray code embeddlng ls an ln[ectlon
G
that maps x |0, l|
to an lnnlte sequence p
0
p
l
p
2
, where
p
i
1 lf 2
i
m2
(i+l)
< x < 2
i
m+2
(i+l)
for an odd m, p
i
0
lf the same holds for an even m, and p
i
lf x 2
i
m2
(i+l)
for some lnteger m
Por a vector x (x
l
, , x
d
),
G
(x) p
l
l
p
d
l
p
l
2
p
d
2
+)/z6
Gray Code Embedding
0 0.5 1
0
1
2
3
5
P
o
s
i
t
i
o
n
0.8
10101...
0.25
0100...
0.1
00010...
+8/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A 0
8 0, 1
C 1, 1
D 1, 1
L 1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A 0
8 0, 1
C 1, 1
D 1, 1
L 1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A 0
8 0, 1
C 1, 1
D 1, 1
L 1
MCL l 2 2
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
00 10 11 01
Lv. 2
01 10 11
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, 10
C 10, 10
D 10, 11
L 11, 11
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
00 10 11 01
Lv. 2
01 10 11
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, 10
C 10, 10
D 10, 11
L 11, 11
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
00 10 11 01
Lv. 2
01 10 11
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, 10
C 10, 10
D 10, 11
L 11, 11
MCL 2 3 6
+p/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
MCL l + l 2
+p/z6
Theoretical Analysis of G-COOL
Use the Gray code as a xed encodlng ln COOL
|t achleves lnternal coheslon and external lsolatlon
Theorem: For the level-k partition
k
, x, y X are in the
same cluster if d
(x, y) < 2
(k+l)
Thus x, y are ln the dlerent clusters only lf d
(x, y) 2
(k+l)
d
(x, y) max
i}l,,d}
|x
i
y
i
| (L
metrlc)
Two ad[acent lntervals overlap and they are agglomerated
Corollary: In the optimal partition
op
, for all x C (C
op
), its nearest neighbor y C
y ls nearest nelghbor of x y argmln
yX
d
(x, y)
zo/z6
Demonstration of G-COOL
0 0.5 1
0
0.5
1
G-COOL
0 0.5 1
0
0.5
1
COOL with the binary encoding
z+/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
z+/z6
Experimental Methods
Analyze G-COOL emplrlcally wlth synthetlc and real
datasets compared to D8SCAN and K-means
Synthetlc datasets were generated by the P package cluster-
Generation
|
Qlu and 1oe, zoo6
|
n l, 500 for each cluster and d 3
Peal datasets were geospatlal lmages from Larth-as-Art
reduced to zoo zoo plxels, translated lnto blnary lmages
All data were normallzed by mln-max normallzatlon
G-COOL was lmplemented by P (verslon z.+z.+)
|nternal and Lxternal measure were used
|nternal: MCL, connectlvlty, Sllhouette wldth
Lxternal: ad[usted Pand lndex
zz/z6
Results (Synthetic datasets)
MCL
G-COOL
DBSCAN
K-means
50000
5000
Number of clusters
2 4 6
500
Data show mean s.e.m.
Each experiment was performed
20 times
Bad
Good
z/z6
Results (Synthetic datasets)
Connectivity Silhouette width
1000
1
100
Number of clusters
2 4 6
0.1
10
0.6
0.3
0.5
Number of clusters
2 4 6
0.4
G-COOL
DBSCAN
K-means
Good
Bad
Bad
Good
z/z6
Results (Synthetic datasets)
Runtime (s) Adjusted Rand index
20
5
15
Number of clusters
2 4 6
0
10
1.0
0.3
0.9
Number of clusters
2 4 6
0.6
G-COOL
DBSCAN
K-means
Good
Bad
z/z6
Results (Synthetic datasets)
MCL
G-COOL
Data show mean s.e.m.
Each experiment was performed
20 times
5000
3000
0
0 4 6 2 8 10
4000
2000
1000
The noise parameter N
Bad
Good
z(/z6
Results (Synthetic datasets)
Connectivity Silhouette width
80
20
60
0
40
0.6
0.2
0.4
0 4 6 2 8 10 0 4 6 2 8 10
The noise parameter N The noise parameter N
0
Good
Bad
Bad
Good
z(/z6
Results (Synthetic datasets)
Runtime (s) Adjusted Rand index
2.0
0.5
1.5
0 4 6
0
1.0
1.0
0.4
0.8
0.6
2 8 10
The noise parameter N
0 4 6 2 8 10
The noise parameter N
0.2
0
Good
Bad
z(/z6
Results (Real datasets)
B
i
n
a
r
y
l
t
e
r
i
n
g
O
r
i
g
i
n
a
l
i
m
a
g
e
Delta Dragon Europe Norway Ganges
z/z6
Results (Real datasets)
G
-
C
O
O
L
K
-
m
e
a
n
s
Delta Dragon Europe Norway Ganges
z/z6
Results (Real datasets)
Name n K Punnlng tlme (s) MCL
GC KM GC KM
Delta zo)(8 ( +.+8 o.o+z (o+o (pzz
Dragon zp8z6 z o.p o.oz6 po6 )+66
Lurope +)8o 6 z.(o( o.o(+ zzo +zz+o
Norway zz))+ o.)(6 o.oz6 +8zo 6++(
Ganges +8o+p 6 o.p o.oz6 zzo +zz6
GC: G-COOL, KM: K-means
z/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
z/z6
Conclusion
|ntegrate clusterlng and lts evaluatlon ln the codlng-
orlented manner
An eectlve solutlon for two essentlal problems, how to mea-
sure goodness of results and how to nd good clusters
No dlstance calculatlon and no data dlstrlbutlon
Key ideas:
+. Fix of an encoding scheme for real-valued variables
|ntroduced the MCL focuslng on compresslon of clusters
Pormulatedclusterlngwlththe MCL, andconstructedCOOL
that nds the global optlmal solutlon llnearly
z. The Gray code
we showed eclency and eectlveness of G-COOL by the-
oretlcally and experlmentally
z6/z6
Appendix
A-+/A-+(
Notation (/)
A datum x
d
, a data set X }x
l
, , x
n
}
#X ls the number of elements ln X
X Y ls the relatlve complement of Y ln X
Clusterlng ls partltlon of X lnto K subsets (clusters) C
l
, , C
K
C
i
and C
i
C
j
we call }C
l
, , C
K
} a partltlon of X
(X) } ls a partltlon of X}
The set of nlte and lnnlte sequences over an alphabet are
denoted by
and
, resp.
The length |w| ls the number of symbols other than
|f w 11100, then |w| 5
Por a set of sequences W, |W|
wW
|w|
A-z/A-+(
Notation (/)
An embeddlng of
d
ls an ln[ectlve functlon from
d
to
Por p, q
, dene p q lf p
i
q
i
for all i wlth p
i
|ntultlvely, q ls more concrete than p
Por w
, we wrlte w p lf w
p
w }p range() w p} for w
A-/A-+(
Optimization by COOL
The optlmal partltlon
op
canbe constructedby the level-
k partltlons
Por all C
op
, there exlsts k such that C
k
The level-k partltlons have the hlerarchlcal structure
Por each C
k
we have C for some D
k+l
COOL ls slmllar to dlvlslve hlerarchlcal clusterlng
COOL always outputs the global optlmal partltlon
op
The tlme complexlty ls O(nd) (best) and O(nd + K! ) (worst)
Usually K n holds, hence O(nd)
A-(/A-+(
Clustering Process of COOL
0 1
0
1
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
Lv. 1
Cluster
1 2 3
1
2 3
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
Lv. 1
Cluster
1 2 3
1
2 3
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
1 2 3
1
2 3
K = 5
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
1 2 3
1
2 3
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
1 2
3
5
4 6
2 1 3 4 5 6
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
The Multi-Dimensional Gray Code
Usethewrapplngfunctlon(p
l
, , p
d
) p
l
l
p
d
l
p
l
2
p
d
2
,d
by
d
G
(x
l
, , x
d
) (
G
(x
l
), ,
G
(x
d
))
we abbrevlate d of
d
G
lf lt ls understood from the context
A-6/A-+(
Internal Measures
Connectlvlty |Handl et al., zoo|
Conn()
xX
M
il
f(x, nn(x, i))/i
nn(x, j) ls the i-th nelghbor of x, f(x, y) ls 0 lf x and y be-
long to the same cluster, and l otherwlse
M ls an lnput parameter (we set as +o)
Takes values from 0 to , should be mlnlmlzed
Sllhouette wldth
The average of Sllhouette value S(x) for each x
S(x) (b(x) a(x)/ max(b(x), a(x)))
a(x) C
l
yC
d(x, y) (x C)
b(x) mln
DC
D
l
yD
d(x, y)
Takes values froml to l, should be maxlmlzed
A-)/A-+(
External Measures
Ad[usted Pand lndex
Let the result be }C
l
, , C
K
} and the correct partl-
tlon be }D
l
, , D
M
}
Suppose n
ij
}x X x C
i
, x D
j
}. Then
i, j n
ij
C
2
(
i C
i
C
2
h D
j
C
2
)/
n
C
2
2
l
(
i C
i
C
2
+
h D
j
C
2
) (
i C
i
C
2
h D
j
C
2
)/
n
C
2
A-8/A-+(
Discussion
Pesults for synthetlc datasets
8est performance under the lnternal measures
(nearly) 8est performance under the lnternal measures
G-COOL ls eclent and eectlve
D8SCAN ls sensltlve to lnput parameters
The MCL works well as an lnternal measure
Pesults for real datasets
not good, and not bad
There are no clear clusters orlglnally
G-COOL ls a good clusterlng method
A-p/A-+(
Related Work
Partltlonal methods
|
Chao[l et al., zoop
|
Mass-based methods
|
Tlng and wells, zo+o
|
Denslty-based methods (D8SCAN
|
Lster et al., +pp6
|