Beruflich Dokumente
Kultur Dokumente
MOC
FII 2010
The problem
Split a set of objects into natural groups based on
similarities/dissimilarities
Given
a set of a attributes
n objects expressed by a values
find an optimal clustering of the objects such that similar data items
are grouped together and dissimilar data reside in different groups.
Machine learning:
unsupervised learning
clusters = hidden patterns
resulting system = data concept
Subject of active research in statistics, pattern recognition,
machine learning
What makes it hard?
What is an optimal partition? The objective/optimum is
imprecisely defined
How to measure the similarities/dissimilarities? Find an
appropriate metric
Solution space for supervised clustering:
1 k k kn S(25,5) 21015
S (n, k ) ( 1) k i
(i ) n
k! i 1 i k!
Rough
Fuzzy
p1
0pi1
p2 p3 p4 p?
p1+p2++p? = 1
for each data item
Algorithms
Partitioning Methods
Relocation Algorithms
K-Means Method
K-Medoids Methods
Probabilistic Clustering
Density-Based Algorithms
Hierarchical methods
Agglomerative Algorithms
Divisive Algorithms
Spectral clustering
Grid-Based Methods
Methods based on Co-Occurrence of Categorical Data
Constraint Based Clustering
Soft Computing: ANN, Evolutionary Methods
K-Means
(Forgy 1965, MacQueen 1967)
Requires the number of clusters
Local search -> local optimum
Dependent on initialization:
Farthest first
Cluster optimally a small random sample
Appropriate for convex clusters of equal volumes
Time complexity: O(tkn)
t iterations
k clusters
n data items
Hierarchical methods
A
G
G D
L I
O V
M I
E S
R I
A V
T E
I
V
E
Agglomerative
Hierarchical methods
Single linkage (Sibson 1973)
May degenerate in chains with less similar points at the ends
Complete linkage (Defays 1977)
Average linkage (Voorhees 1986)
Centroid linkage
Minimum-Variance linkage (Ward 1963)
Divisive
Hierarchical methods
Very few approaches
What is the size of the search space in first iteration?
DIANA
Select the cluster with the largest diameter (the largest
dissimilarity between any two of its observations)
Split the selected cluster into two new clusters:
The most disparate observation (which has the largest average
dissimilarity) initiates the splinter group
The points are reassigned based on the distance between the splinter
group and the old group
DIANA - Example
DIANA - Example
Best value
Average of k
distance to
centroid
k
K-Means estimating k
Too few; x
many long x
distances xx x
to centroid. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
K-Means estimating k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
K-Means estimating k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Estimating k in HC
The Elbow/Knee method
Estimating k
The Elbow/Knee method
Take the difference of successive index values
The largest ratio difference between two points.
The first data point with a second derivative above some
threshold value.
The data point with the largest second derivative.
The point on the curve that is furthest from a line fitted to
the entire curve.
Find the boundary between the pair of straight lines that
most closely fit the curve.
Stan Salvador, Philip Chan ICTAI 2004
Estimating k
Information criteria
i is the maximum likelihood estimate for the variance of the i-th cluster
under the identical spherical Gaussian assumption
Estimating k
BIC
Strategy: first local maximum indicates strong evidence for
the model size
Not always true
(Luchian, 1999)
Other criteria
Silhouette Width
(Rousseuw,1987)
Davies-Bouldin Index
(Davies, Bouldin, 1979)
Multi-objective optimization
(Handl, Knowles, 2005)
Unsupervised clustering criteria
37
Evolutionary algorithms for
clustering
Representation
(variable length) string of cluster centres (Luchian, 1994)
n-length chromosome; allele values from 1 to n; c[i]=j means i
and j are in the same cluster (Handl & Knowles, 2004)
Evaluation
Supervised or unsupervised clustering criteria
Performance:
Same as k-Means if the latter is given the best possible
initialization
In case of Multi-Objective GAs the performance is signifficantly
better
Subspace clustering
In high dimensional data many dimensions are often
irrelevant
Find clusters in different subspaces within a dataset
Algorithms:
Bottom-up
CLIQUE
ENCLUS
MAFIA
Top-down
PROCLUS
COSA
(for a survey and comparative study: Parsons, Haque,Liu
Evaluating Subspace Clustering Algorithms ,2004)
Ensemble clustering
Goals
Reach consensus
Create more robust and stable solutions
Research
Ensemble construction
Bagging
Boosting
Random feature subspaces
Different clustering algorithms
Procedures to combine individual solutions
Ensemble clustering
0 1 0 0
0 0 0 1 agregation Graph
0 0 0 0
partitioning
n
*
C1
C2
Cm
C
k Snxn
Ci - member of ensemble
n - # data items Sij cos(Cip , C qj )
p , q 1..m
ki - # clusters for Ci
41
* METIS - http://glaros.dtc.umn.edu/gkhome/