Sie sind auf Seite 1von 41

Data Mining: CUSTERING

MOC
FII 2010
The problem
Split a set of objects into natural groups based on
similarities/dissimilarities
Given
a set of a attributes
n objects expressed by a values
find an optimal clustering of the objects such that similar data items
are grouped together and dissimilar data reside in different groups.
Machine learning:
unsupervised learning
clusters = hidden patterns
resulting system = data concept
Subject of active research in statistics, pattern recognition,
machine learning
What makes it hard?
What is an optimal partition? The objective/optimum is
imprecisely defined
How to measure the similarities/dissimilarities? Find an
appropriate metric
Solution space for supervised clustering:
1 k k kn S(25,5) 21015
S (n, k ) ( 1) k i
(i ) n
k! i 1 i k!

Solution space for unsupervised clustering:


1 kn
B(25) 41018
B ( n)
e k 1 k!
Approaches
Crisp

Rough

Fuzzy

p1
0pi1
p2 p3 p4 p?
p1+p2++p? = 1
for each data item
Algorithms
Partitioning Methods
Relocation Algorithms
K-Means Method
K-Medoids Methods
Probabilistic Clustering
Density-Based Algorithms

Hierarchical methods
Agglomerative Algorithms
Divisive Algorithms

Spectral clustering
Grid-Based Methods
Methods based on Co-Occurrence of Categorical Data
Constraint Based Clustering
Soft Computing: ANN, Evolutionary Methods
K-Means
(Forgy 1965, MacQueen 1967)
Requires the number of clusters
Local search -> local optimum
Dependent on initialization:
Farthest first
Cluster optimally a small random sample
Appropriate for convex clusters of equal volumes
Time complexity: O(tkn)
t iterations
k clusters
n data items
Hierarchical methods

A
G
G D
L I
O V
M I
E S
R I
A V
T E
I
V
E
Agglomerative
Hierarchical methods
Single linkage (Sibson 1973)
May degenerate in chains with less similar points at the ends
Complete linkage (Defays 1977)
Average linkage (Voorhees 1986)
Centroid linkage
Minimum-Variance linkage (Ward 1963)
Divisive
Hierarchical methods
Very few approaches
What is the size of the search space in first iteration?

DIANA
Select the cluster with the largest diameter (the largest
dissimilarity between any two of its observations)
Split the selected cluster into two new clusters:
The most disparate observation (which has the largest average
dissimilarity) initiates the splinter group
The points are reassigned based on the distance between the splinter
group and the old group
DIANA - Example

Use the average distance


DIANA - Example

The distance between clusters


is given by the maximum
distance between entities
in the two clusters.
Hierarchical methods
Space complexity - distance matrix: O(n2)
Space complexity dendrogram: O(kn)
Time complexity: O(n3) -> O(n2log(n)) -> O(n2)

Impossible to reallocate points


No objective function directly minimized
Not incremental
Unsupervised clustering
How many clusters in data?
Traditional methods need to know in advance
There might be no definite or unique answer
Methods to estimate k:
1. The elbow method
2. Information scores
3. Unsupervised clustering criteria
Estimating k in k-Means
The Elbow/Knee method
Try different k, plot the average distance to centroid, as k
increases.
Average falls rapidly until right k, then changes little.

Best value
Average of k
distance to
centroid

k
K-Means estimating k
Too few; x
many long x
distances xx x
to centroid. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
K-Means estimating k
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
K-Means estimating k
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x

x x x
x x x x
x x x
x
Estimating k in HC
The Elbow/Knee method
Estimating k
The Elbow/Knee method
Take the difference of successive index values
The largest ratio difference between two points.
The first data point with a second derivative above some
threshold value.
The data point with the largest second derivative.
The point on the curve that is furthest from a line fitted to
the entire curve.
Find the boundary between the pair of straight lines that
most closely fit the curve.
Stan Salvador, Philip Chan ICTAI 2004
Estimating k
Information criteria

Bayesian Information Criterion

Akaike Information Criterion


Estimating k
BIC
1
BIC L( ) k log n
2
k
ni d n ni k 1
(ni log ni ni log n log( 2 ) i log i ) k log n
i 1 2 2 2 2
ni
1
i || x j Ci ||2
ni k j 1

L() is the log-likelihood function according to the model

i is the maximum likelihood estimate for the variance of the i-th cluster
under the identical spherical Gaussian assumption
Estimating k
BIC
Strategy: first local maximum indicates strong evidence for
the model size
Not always true

Zhao, Hautamaki, Franti, 2008


Estimating k

The main drawback of previous methods:


Given two partitions of a data set with k1 and k2 numbers of
clusters, which one is the best?

To answer the question, a criterion able to offer an ordering of


all partitions irrespective of the number of clusters is
required
Clustering = Optimization
What about the criterion?
The quality of a partition:
Compactness

maximize the similarity between the elements of the


same cluster (minimize the intra-cluster distance)
Separability/Isolation

minimizes the similarity between elements


belonging to different clusters (maximize inter-
cluster distance)
Clustering = Optimization
What about the criterion?
Minimizing the compactness and maximizing the isolation are
appropriate for supervised clustering
In the unsupervised case these lead in the extreme case to n
clusters (n-the number of data items)
Unsupervised clustering criteria
Penalize increasing number of clusters
(Luchian, 1994)

(Luchian, 1999)

Other criteria
Silhouette Width
(Rousseuw,1987)

Davies-Bouldin Index
(Davies, Bouldin, 1979)

Multi-objective optimization
(Handl, Knowles, 2005)
Unsupervised clustering criteria

(M.Breaban & H. Luchian, GECCO 2009)

Synthetic data sets


J.Handl and J. Knowles:
http://dbkgroup.org/handl/generators/

37
Evolutionary algorithms for
clustering
Representation
(variable length) string of cluster centres (Luchian, 1994)
n-length chromosome; allele values from 1 to n; c[i]=j means i
and j are in the same cluster (Handl & Knowles, 2004)
Evaluation
Supervised or unsupervised clustering criteria
Performance:
Same as k-Means if the latter is given the best possible
initialization
In case of Multi-Objective GAs the performance is signifficantly
better
Subspace clustering
In high dimensional data many dimensions are often
irrelevant
Find clusters in different subspaces within a dataset
Algorithms:
Bottom-up
CLIQUE
ENCLUS
MAFIA
Top-down
PROCLUS
COSA
(for a survey and comparative study: Parsons, Haque,Liu
Evaluating Subspace Clustering Algorithms ,2004)
Ensemble clustering
Goals
Reach consensus
Create more robust and stable solutions
Research
Ensemble construction
Bagging
Boosting
Random feature subspaces
Different clustering algorithms
Procedures to combine individual solutions
Ensemble clustering

0 1 0 0
0 0 0 1 agregation Graph
0 0 0 0
partitioning
n
*
C1
C2

Cm
C
k Snxn

Ci - member of ensemble
n - # data items Sij cos(Cip , C qj )
p , q 1..m
ki - # clusters for Ci

41
* METIS - http://glaros.dtc.umn.edu/gkhome/

Das könnte Ihnen auch gefallen