Beruflich Dokumente
Kultur Dokumente
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
2
n
n
Outlier detection
n
Summarization:
n
Compression:
n
n
n
Scalability
n Clustering all the data instead of only on samples
Ability to deal with different types of attributes
n Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
n User may give inputs on constraints
n Use domain knowledge to determine input parameters
Interpretability and usability
Others
n Discovery of clusters with arbitrary shape
n Ability to deal with noisy data
n Incremental clustering and insensitivity to input order
n High dimensionality
8
Partitioning criteria
n
Separation of clusters
n
Single level vs. hierarchical partitioning (often, multilevel hierarchical partitioning is desirable)
Exclusive (e.g., one customer belongs to only one
region) vs. non-exclusive (e.g., one document may
belong to more than one class)
Clustering space
n
Partitioning approach:
n Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
n Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
n Create a hierarchical decomposition of the set of data (or objects)
using some criterion
n Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
n Based on connectivity and density functions
n Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
n based on a multiple-level granularity structure
n Typical methods: STING, WaveCluster, CLIQUE
10
Model-based:
n A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
n Typical methods: EM, SOM, COBWEB
Frequent pattern-based:
n Based on the analysis of frequent patterns
n Typical methods: p-Cluster
User-guided or constraint-based:
n Clustering by considering user-specified or application-specific
constraints
n Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
n Objects are often linked together in various ways
n Massive links can be used to cluster objects: SimRank, LinkClus
11
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
12
E = ik=1 pCi ( p ci ) 2
n
Loop if
needed
Reassign objects
Update the
cluster
centroids
Update the
cluster
centroids
Until no change
15
Weakness
n
16
Dissimilarity calculations
17
10
10
8
7
6
5
4
3
2
1
0
10
8
7
6
5
4
3
2
1
0
10
18
19
K=2
10
10
10
7
6
5
4
3
2
Arbitrary
choose k
object as
initial
medoids
1
0
10
7
6
5
4
3
2
1
0
0
10
Assign
each
remainin
g object
to
nearest
medoids
7
6
5
4
3
2
1
0
0
Until no
change
Compute
total cost of
swapping
10
Swapping O
and Oramdom
If quality is
improved.
7
6
4
3
2
1
0
10
10
Randomly select a
nonmedoid object,Oramdom
Total Cost = 26
Do loop
10
8
7
6
5
4
3
2
1
0
10
20
PAM Algorithm
1.
2.
3.
4.
21
22
PAM works effectively for small data sets, but does not
scale well for large data sets
Computational complexity is O(k(n-k)2 for each iteration
n
n is # of data, k is # of clusters
23
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
24
Hierarchical Clustering
n
a
b
Step 1
ab
abcde
cde
de
e
Step 4
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
25
26
Go on in a non-descending fashion
10
10
0
0
10
0
0
10
10
27
10
10
10
0
0
10
0
0
10
10
28
Cm =
iN= 1(t
ip
Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the order of the
data record
32
CF = (5, (16,30),(54,190))
Xi
i =1
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
33
CF-Tree in BIRCH
n
Clustering
feature:
n
34
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
35
n
n
Cluster Diameter
1
2
( xi x j )
n( n 1)
36
Sparse Graph
Data Set
K-NN
Graph
P
and
q
are
connected
if
q
is
among
the
top
k
closest
neighbors
of
p
Merge Partition
Rela-ve
interconnec-vity:
connec4vity
of
c1
and
c2
over
internal
connec4vity
Final Clusters
Rela-ve
closeness:
closeness
of
c1
and
c2
over
internal
closeness
38
39
Generative Model
n
41
n
n
For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality
can be measured by,
where P() is the maximum likelihood
Distance between clusters C1 and C2:
Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 i n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi j {log (P(CiCj )/(P(Ci)P(Cj ))};
If log (P(CiCj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
42
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
43
Two parameters:
n
n
n
n
p belongs to NEps(q)
p
q
MinPts = 5
Eps = 1 cm
45
Density-reachable:
n
p
p1
Density-connected
n
A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q
are density-reachable from o w.r.t.
Eps and MinPts
q
o
46
Eps = 1cm
MinPts = 5
47
n
n
48
49
Index-based:
n
k = number of dimensions
N = 20
p = 75%
M = N(1-p) = 5
n Complexity: O(NlogN)
Core Distance:
n min eps s.t. point is core
Reachability Distance p2
n
p1
o
o
MinPts = 5
= 3 cm
51
Reachability
-distance
undefined
Cluster-order
of the objects
52
53
f Gaussian ( x, y) = e
n
Major features
d ( x,y)
2 2
influence of y
on x
f
f
D
Gaussian
( x) =
D
Gaussian
i =1
d ( x , xi ) 2
2
( x, xi ) = i =1 ( xi x) e
total influence
on x
d ( x , xi ) 2
2 2
gradient of x in
the direction of
xi
n
n
Uses grid cells but only keeps information about grid cells that do
actually contain data points and manages these cells in a tree-based
access structure
Influence function: describes the impact of a data point within its
neighborhood
Overall density of the data space can be calculated as the sum of the
influence function of all data points
Clusters can be determined mathematically by identifying density
attractors
Density attractors are local maximal of the overall density function
Center defined clusters: assign to each density attractor the points
density attracted to it
Arbitrary shaped cluster: merge density attractors that are connected
through paths of high density (> threshold)
55
Density Attractor
56
57
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
58
(i-1)st layer
i-th layer
60
n
n
n
n
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters
n
n
=3
30
40
Vacation
20
50
Salary
(10,000)
0 1 2 3 4 5 6 7
30
Vacation(
week)
0 1 2 3 4 5 6 7
age
60
20
50
30
40
50
age
60
age
65
Strength
n automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces
n insensitive to the order of records in input and does not
presume some canonical data distribution
n scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
n The accuracy of the clustering result may be degraded
at the expense of simplicity of the method
66
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
67
68
70
CLIQUE, ProClus
72
Why p-Clustering?
n
n
n
n
Ij
1
d
| I | i I ij
Where
IJ
1
d
| I || J | i I , j J ij
p-Clustering: Clustering
by Pattern Similarity
n
Properties of -pCluster
n
Downward closure
d xa / d ya
d xb / d yb
74
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evalua4on
of
Clustering
75
Empirical method
n # of clusters n/2 for a dataset of n points
Elbow method
n Use the turning point in the curve of sum of within cluster variance
w.r.t the # of clusters
Cross validation method
n Divide a given data set into m parts
n Use m 1 parts to obtain a clustering model
n Use the remaining part to test the quality of the clustering
n E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the
test set
n For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different ks, and find # of clusters that fits the data the best
77
Summary
n
n
n
80
References (1)
n
n
n
n
n
References (2)
n
References (3)
n
n
n
n
n
n
n
n
n
84
Weakness:
n
n
Traditional measures for categorical data may not work well, e.g.,
Jaccard coefficient
Example: Two groups (clusters) of transactions
n
C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e},
{a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
n
C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Jaccard co-efficient may lead to wrong clustering result
n
C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d})
n
C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f})
Jaccard co-efficient-based similarity function:
T1 T2
Sim( T1 , T2 ) =
T1 T2
n
Ex. Let T1 = {a, b, c}, T2 = {c, d, e}
Sim(T 1, T 2) =
{c}
{a, b, c, d , e}
1
= 0.2
5
88
Clusters
n C1:<a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e},
{b, c, d}, {b, c, e}, {b, d, e}, {c, d, e}
n C2: <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
Neighbors
n Two transactions are neighbors if sim(T1,T2) > threshold
Let T1 = {a, b, c}, T2 = {c, d, e}, T3 = {a, b, f}
n T1 connected to: {a,b,d}, {a,b,e}, {a,c,d}, {a,c,e}, {b,c,d}, {b,c,e},
{a,b,f}, {a,b,g}
n T2 connected to: {a,c,d}, {a,c,e}, {a,d,e}, {b,c,e}, {b,d,e}, {b,c,d}
n T3 connected to: {a,b,c}, {a,b,d}, {a,b,e}, {a,b,g}, {a,f,g}, {b,f,g}
Link Similarity
n Link similarity between two transactions is the # of common neighbors
n
Rock Algorithm
n
Method
n Compute similarity matrix
n
n
n
Problems:
n Guarantee cluster interconnectivity
n
4
0.9
1.0 0.8
10
11
12
0.9
1.0
13
14
ST2
ST1
For each node nk {n10, n11, n12} and nl {n13, n14}, their pathbased similarity simp(nk, nl) = s(nk, n4)s(n4, n5)s(n5, nl).
12
sim(na , nb ) =
k =10 s(nk , n4 )
3
14
s(n , n )
4
l =13
s(nl , n5 )
2
= 0.171
a:
(0.9,3)
4
10
11
0.2
12
b:(0.95,2)
5
13
14
n
n
Find all pairs of sibling nodes ni and nj, so that na linked with ni and nb
with nj.
Calculate similarity (and weight) between na and nb w.r.t. ni and nj.
Calculate weighted average similarity between na and nb w.r.t. all such
pairs.
92
Par44oning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Summary
93
Tom
Mike
Cathy
John
Mary
Proceedings
Conferences
sigmod03
sigmod04
sigmod
sigmod05
vldb03
vldb04
vldb05
aaai04
aaai05
vldb
C
sim(a, b ) =
I (a ) I (b )
I (a ) I (b )
sim(I (a ), I (b))
i
i =1 j =1
aaai
A hierarchical structure of
products in Walmart
grocery electronics
TV
DVD
apparel
Articles
All
camera
Words
95
0.4
0.3
0.2
0.1
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
similarity value
n
Digital
Sony V3 digital Cameras
Consumer
camera
electronics
Apparels
TVs
97
Adjustment ratio
for node n7
0.8
0.9
n7
n4
0.3
0.8
n2
0.2
0.9
0.9
n5
n6
n8
n3
1.0
n9
Initialization of SimTrees
n
Initializing a SimTree
n Repeatedly find groups of tightly related nodes, which
are merged into a higher-level node
Tightness of a group of nodes
n For a group of nodes {n1, , nk}, its tightness is
defined as the number of leaf nodes in other SimTrees
that are connected to all of {n1, , nk}
Nodes
n1
n2
Leaf nodes in
another SimTree
1
2
3
4
5
100
Reduced to
The tightness of a
g1
group of nodes is the
support of a frequent
pattern
g2
n
n1
n2
n3
n4
Transactions
{n1}
{n1, n2}
{n2}
{n1, n2}
{n1, n2}
{n2, n3, n4}
{n4}
{n3, n4}
{n3, n4}
0.8
n7
n
0.9
n2
n5
n7 n8
n3
n6
n9
Complexity
For two types of objects, N in each, and M linkages between them.
Time
Space
Updating similarities
O(M(logN)2)
O(M+N)
O(N)
O(N)
LinkClus
O(M(logN)2)
O(M+N)
SimRank
O(M2)
O(N2)
103
1579.6
0.7965
39160
0.5711
74.6
0.3688
479.7
0.4768
8.55
Approaches compared:
n
pre-computes a large sample of random paths from each object and uses
samples of two objects to estimate SimRank similarity
105
Quantization
& Transformation
n
107