Beruflich Dokumente
Kultur Dokumente
Cluster Analysis
Partitioning Methods
Hierarchical Methods
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering
feature spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns
5
Examples of Clustering
Applications
Scalability
High dimensionality
Data Structures
Data matrix
(two modes)
x11
...
x
i1
...
x
n1
Dissimilarity matrix
(one mode)
...
x1f
...
x1p
...
...
...
...
xif
...
...
xip
...
...
... xnf
...
...
...
xnp
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
8
Interval-scaled variables:
Binary variables:
10
Interval-valued variables
Standardize data
where
...
xnf )
m f 1n (x1 f x2 f
xif m f
zif
sf
q
q
Some popular
d (i, j) q (| ones
x x |include:
| x x Minkowski
| q ... | x x |distance:
)
i1
j1
i2
j2
ip
jp
If q = 1, d isd (iManhattan
, j) | x x | |distance
x x | ... | x x |
i1
j1
i2
j2
ip
jp
12
If q = 2, d is Euclidean distance:
d (i, j) (| x x | 2 | x x | 2 ... | x x |2 )
i1
j1
i2
j2
ip
jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
Binary Variables
Object i
1
0
1
a
c
0
b
d
sum a c b d
sum
a b
cd
p
variable is asymmetric):
d (i, j)
April 27, 2016
bc
a bc
14
Example
Name
Jack
Mary
Jim
Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
15
Nominal Variables
16
Ordinal Variables
rif 1
M f 1
Ratio-Scaled Variables
Methods:
18
Variables of Mixed
Types
d (i, j )
f 1 ij
p
f 1
ij
(f)
ij
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks r and
if
and treat z as interval-scaled
if
z r 1
if
if
19
20
21
Cluster Analysis
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
22
23
K-MEANS
CLUSTERING
Step 1:
Initialization: Randomly we choose following two
centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).
Step 2:
Thus, we obtain two
clusters containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
Step 3:
Now using these
centroids we compute
the Euclidean distance
of each object, as
shown in table.
Step 4 :
The clusters obtained
are:
{1,2} and {3,4,5,6,7}
Therefore, there is no
change in the cluster.
Thus, the algorithm
comes to a halt here
and final result consist
of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3)
Step 1
Step 2
PLOT
Attribute1
weight index
Medicine B
Medicine C
Medicine D
Medicine A
Step 1:
Initial value of
centroids : Suppose
we use medicine A
and medicine B as
the first centroids.
Let and c1 and c2
denote the
coordinate of the
centroids, then
c1=(1,1) and
c2=(2,1)
Step 2:
Objects clustering :
We assign each object
based on the minimum
distance.
Medicine A is assigned
to group 1, medicine B
to group 2, medicine C
to group 2 and
medicine D to group 2.
The elements of Group
matrix below is 1 if and
only if the object is
assigned to that group.
Iteration-1, Objects
clustering:Based on the
new distance matrix, we
move the medicine B to
Group 1 while all the other
objects remain. The Group
matrix is shown below
Iteration 2, determine
centroids: Now we repeat
step 4 to calculate the new
centroids coordinate based
on the clustering of
previous iteration. Group1
and group 2 both has two
members, thus the new
centroids are
and
Iteration-2, Objects-Centroids
distances : Repeat step 2 again, we have
new distance matrix at iteration 2 as
Feature1(X):
weight index
Feature2
(Y): pH
Group
(result)
Medicine A
Medicine B
Medicine C
Medicine D
45
Example
10
9
8
7
6
5
10
10
4
3
2
1
0
0
K=2
Arbitrarily choose
K object as initial
cluster center
10
Assign
each
objects
to
most
similar
center
3
2
1
0
0
10
4
3
2
1
0
0
reassign
10
10
2
1
0
0
10
reassign
Update
the
cluster
means
10
Update
the
cluster
means
4
3
2
1
0
0
10
46
Weakness
47
Dissimilarity calculations
48
10
0
0
10
10
49
PAM works effectively for small data sets, but does not
scale well for large data sets
50
Total Cost = 20
10
10
10
6
5
4
3
2
1
0
0
K=2
10
Arbitrar
y
choose
k object
as
initial
medoid
s
Assign
each
remaini
ng
object
to
nearest
medoid
s
6
5
4
3
2
1
0
0
10
Total Cost = 26
10
Do loop
Until no
change
6
5
4
3
2
1
0
0
10
Compute
total cost
of
swapping
Swapping
O and
Oramdom
If quality is
improved.
7
6
5
4
10
Randomly select a
nonmedoid
object,Oramdom
8
7
6
5
4
0
1
2
3
4
5
6
7
8
9
10
Data
Mining:
Concepts
and
Techniques
10
51
52
10
8
7
8
7
6
5
4
3
6
5
h
i
4
3
0
0
10
Cjih = 0
0
10
10
10
7
6
5
4
3
Cjih and
= d(j, h) - d(j, t)
Cjih = d(j, t) - d(j, i) Data Mining: Concepts
0
10
Techniques
10
53
54
Weakness:
55
56
Cluster Analysis
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Outlier Analysis
Summary
57
Hierarchical Clustering
a
b
Step 1
ab
abcde
cde
de
e
Step 4
April 27, 2016
agglomerative
(AGNES)
Step 3
divisive
(DIANA)
58
Go on in a non-descending fashion
10
10
10
0
0
10
0
0
10
10
59
60
10
10
0
0
10
0
0
10
10
61
BIRCH (1996)
SS: Ni=1=Xi2
10
9
8
7
6
5
4
3
2
1
0
0
10
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
64
CF-Tree in BIRCH
Clustering feature:
summary of the statistics for a given subcluster: the 0-th, 1st and
2nd moments of the subcluster from the statistical point of view.
CF Tree
Root
B=7
CF1
CF2 CF3
CF6
L=6
child1
child2 child3
child6
CF1
Non-leaf node
CF2 CF3
CF5
child1
child2 child3
child5
Leaf node
prev
CF1 CF2
CF6 next
Leaf node
prev
CF1 CF2
CF4 next
66
Sim( T 1, T 2)
{3}
1
0.2
{1,2,3,4,5}
5
67
Rock: Algorithm
Algorithm
Draw random sample
Cluster with links
Label data in disk
68
CHAMELEON (Hierarchical
clustering using dynamic
modeling)
A two-phase algorithm
1.
2.
Overall Framework of
CHAMELEON
Construct
Sparse Graph
Data Set
Merge Partition
Final Clusters
70