Cluster Validation

•
Cluster Validity
• For supervised classification we have a variety of
measures to evaluate how good our model is
•
Cluster Validation
• •
•
• Accuracy, precision, recall
• For cluster analysis, the analogous question is how to

evaluate the “goodness” of the resulting clusters?
• Cluster validation
– Assess the quality and reliability of clustering results.
•
• Why validation?
– To avoid finding clusters formed by chance
– To compare clustering algorithms
– To choose clustering parameters
• e.g., the number of clusters in the K-means algorithm
Cluster Validation Clusters found in Random Data

1 1
0.9 0.9
0.8 0.8
• Cluster validation Random

0.7
0.6
0.7
0.6 DBSCAN
– Assess the quality and reliability of clustering Points 0.5 0.5
y
0.4 0.4
results. 0.3
0.2
0.3
0.2
• Why validation?
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
– To avoid finding clusters formed by chance 1 1
0.9 0.9
– To compare clustering algorithms K-means 0.8
0.7
0.8
0.7
Complete
– To choose clustering parameters 0.6
0.5
0.6
0.5
Link
y
y
• e.g., the number of clusters in the K-means 0.4 0.4
algorithm 0.3
0.2
0.3
0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
3 4
K K Aggarwal, Dept of OR, DU 1

Aspects of Cluster Validation Cluster validation process
• Cluster validation refers to procedures that evaluate the results
• Comparing the clustering results to ground truth of clustering in a quantitative and objective fashion
(externally known results). – How to be “quantitative”: To employ the measures.
– External Index
– How to be “objective”: To validate the measures!
• Evaluating the quality of clusters without reference
to external information.
– Use only the data
– Internal Index INPUT: Clustering Partitions P Validity m*
• Determining the reliability of clusters. DataSet(X) Algorithm Codebook C Index
– To what confidence level, the clusters are not formed
by chance
– Statistical framework Different number of clusters m
5 6
Measuring clustering validity

Internal Index:
• Validate without external info
• Solve the number of clusters ? ?
External Index Internal indexes
• Validate against ground truth
• Compare two clusters:
(how similar) ?
?
7 8

Internal indexes Mean square error (MSE)
• Minimizes (or maximizes) internal index: • The more clusters the smaller the MSE.
– Rule of thumb • Small knee-point near the correct value.
One simple rule of thumb sets the number to • But how to detect?
10
S2
9
with n as the number of objects (data points).
8
– Variances of within cluster and between clusters 7
– Rate-distortion method 6
MSE
5 Knee-point between
– F-ratio 4
14 and 15 clusters.
– Davies-Bouldin index (DBI) 3
2
– Bayesian Information Criterion (BIC)
1
– Silhouette Coefficient 0
5 10 15 20 25
9 Clusters 10
From MSE to cluster validity Sum-of-squares based indexes

• SSW / k ---- Ball and Hall (1965)
• Minimize within cluster variance (MSE) • k2|W| ---- Marriot (1971)
• Maximize between cluster variance • SSB / k  1 ---- Calinski & Harabasz (1974)
Inter-cluster SSW / N  k
Intra-cluster variance is
maximized • log(SSB/SSW) ---- Hartigan (1975)
variance is
minimized • ---- Xu (1997)
d log( SSW /(dN 2 ))  log(k )
(d is the dimension of data; N is the size of data; k is the number of clusters)
SSW = Sum of squares within the clusters (=MSE)

SSB = Sum of squares between the clusters
11 12

Internal Measures: Cohesion and Separation Internal Measures: Cohesion
• Cluster Cohesion: Measures how closely related are objects in a and Separation
• Example: SSE
cluster
– BSS + WSS = constant
– Example: SSE m
• Cluster Separation: Measure how distinct or well-separated a   
cluster is from other clusters 1 m 2 3 4 m 5
• Example: Squared Error 1 2
– Cohesion is measured by the within cluster sum of squares (SSE) K=1 WSS  (1  3) 2  ( 2  3) 2  ( 4  3) 2  (5  3) 2  10
SSW    ( x  mi ) 2 cluster:
BSS 4  (3  3)  02
i xCi Total  10  0  10
– Separation is measured by the between cluster sum of squares
SSB   C i (m  mi ) 2 K=2
WSS (1  1.5) 2  (2  1.5) 2  (4  4.5)2  (5  4.5) 2  1
i BSS 2  (3  1.5) 2  2  (4.5  3)2  9
– Where |Ci| is the size of cluster i clusters: Total  1  9  10
Total Vatiance =
 ( X )  SSW  SSB 13
F-ratio variance test F-ratio for dataset S1

• Variance-ratio F-test 1.4
• Measures ratio of between-groups variance 1.2

against the within-groups variance (original f-test)
F-ratio (x10^5)
1.0
PNN
• F-ratio (WB-index): 0.8
IS
N 0.6
k   || xi  c p ( i ) ||2 minimum
k  SSW 0.4
F i 1

k
 ( X )  SSW 0.2
 n j || c j  x ||2
j 1
SSB 0.0
25 23 21 19 17 15 13 11 9 7 5
Clusters
15 16

Davies-Bouldin index (DBI) Davies-Bouldin index (DBI)
• Minimize intra cluster variance

• Maximize the distance between
clusters
• Cost function weighted sum of the two:
MAE j  MAE k
R j ,k 
d (c j , c k )
1 M
DBI   max R j ,k
M j 1 j  k
18
Silhouette coefficient Silhouette coefficient

[Kaufman&Rousseeuw, 1990] [Kaufman&Rousseeuw, 1990]
 We need a quantitative method to assess the quality of a clustering... • Cohesion: measures how closely related are
 The silhouette value of a point is a measure of how similar a point is to points in its own
cluster compared to points in other clusters
objects in a cluster
• Separation: measure how distinct or well-
 Formal definition:
separated a cluster is from other clusters
• a(i) is the average distance of the point i to the other points in its own cluster A
• d(i, C) is the average distance of the point i to the other points in the cluster C
• b(i) is the minimal d(i, C) over all clusters other than A
cohesion
separation

Silhouette coefficient Silhouette coefficient
• Cohesion a(x): average distance of x to all other vectors
in the same cluster.
• Separation b(x): average distance of x to the vectors in
x other clusters. Find the minimum among the clusters.
x • silhouette s(x):
b( x )  a ( x )
s( x) 
max{a( x), b( x)}
cohesion
• s(x) = [-1, +1]: -1=bad, 0=indifferent, 1=good
a(x): average distance separation • Silhouette coefficient (SC):
in the cluster 1 N
b(x): average distances to SC 
N
 s( x)
i 1
others clusters, find minimal
Silhouette coefficient Performance of

Silhouette coefficient
24

Internal indexes Internal indexes
25
Soft partitions 26
Comparison of the indexes

K-means
27

Cluster Validation

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cluster Validation

Hochgeladen von

Copyright:

Verfügbare Formate

•

• Accuracy, precision, recall

• For cluster analysis, the analogous question is how to

Cluster Validation Clusters found in Random Data

• Cluster validation Random

– Assess the quality and reliability of clustering Points 0.5 0.5

– To compare clustering algorithms K-means 0.8

– To choose clustering parameters 0.6

K K Aggarwal, Dept of OR, DU 1

Measuring clustering validity

K K Aggarwal, Dept of OR, DU 2

From MSE to cluster validity Sum-of-squares based indexes

SSW = Sum of squares within the clusters (=MSE)

K K Aggarwal, Dept of OR, DU 3

F-ratio variance test F-ratio for dataset S1

• Measures ratio of between-groups variance 1.2

K K Aggarwal, Dept of OR, DU 4

• Minimize intra cluster variance

Silhouette coefficient Silhouette coefficient

K K Aggarwal, Dept of OR, DU 5

Silhouette coefficient Performance of

K K Aggarwal, Dept of OR, DU 6

Comparison of the indexes

K K Aggarwal, Dept of OR, DU 7

Das könnte Ihnen auch gefallen