Clustering PDF

Data Mining: CUSTERING
MOC
FII 2010
The problem
Split a set of objects into natural groups based on
similarities/dissimilarities
Given
a set of a attributes
n objects expressed by a values
find an optimal clustering of the objects such that similar data items
are grouped together and dissimilar data reside in different groups.
Machine learning:
unsupervised learning
clusters = hidden patterns
resulting system = data concept
Subject of active research in statistics, pattern recognition,
machine learning
What makes it hard?
What is an optimal partition? The objective/optimum is
imprecisely defined
How to measure the similarities/dissimilarities? Find an
appropriate metric
Solution space for supervised clustering:
1 k k kn S(25,5) 21015
S (n, k ) ( 1) k i
(i ) n
k! i 1 i k!
Solution space for unsupervised clustering:

1 kn
B(25) 41018
B ( n)
e k 1 k!
Approaches
Crisp
Rough
Fuzzy
p1
0pi1
p2 p3 p4 p?
p1+p2++p? = 1
for each data item
Algorithms
Partitioning Methods
Relocation Algorithms
K-Means Method
K-Medoids Methods
Probabilistic Clustering
Density-Based Algorithms
Hierarchical methods
Agglomerative Algorithms
Divisive Algorithms
Spectral clustering
Grid-Based Methods
Methods based on Co-Occurrence of Categorical Data
Constraint Based Clustering
Soft Computing: ANN, Evolutionary Methods
K-Means
(Forgy 1965, MacQueen 1967)
Requires the number of clusters
Local search -> local optimum
Dependent on initialization:
Farthest first
Cluster optimally a small random sample
Appropriate for convex clusters of equal volumes
Time complexity: O(tkn)
t iterations
k clusters
n data items
A
G
G D
L I
O V
M I
E S
R I
A V
T E
I
V
E
Agglomerative
Single linkage (Sibson 1973)
May degenerate in chains with less similar points at the ends
Complete linkage (Defays 1977)
Average linkage (Voorhees 1986)
Centroid linkage
Minimum-Variance linkage (Ward 1963)
Divisive
Very few approaches
What is the size of the search space in first iteration?
DIANA
Select the cluster with the largest diameter (the largest
dissimilarity between any two of its observations)
Split the selected cluster into two new clusters:
The most disparate observation (which has the largest average
dissimilarity) initiates the splinter group
The points are reassigned based on the distance between the splinter
group and the old group
DIANA - Example
Use the average distance

DIANA - Example
The distance between clusters

is given by the maximum
distance between entities
in the two clusters.
Space complexity - distance matrix: O(n2)
Space complexity dendrogram: O(kn)
Time complexity: O(n3) -> O(n2log(n)) -> O(n2)
Impossible to reallocate points

No objective function directly minimized
Not incremental
Unsupervised clustering
How many clusters in data?
Traditional methods need to know in advance
There might be no definite or unique answer
Methods to estimate k:
1. The elbow method
2. Information scores
3. Unsupervised clustering criteria
Estimating k in k-Means
The Elbow/Knee method
Try different k, plot the average distance to centroid, as k
increases.
Average falls rapidly until right k, then changes little.
Best value
Average of k
distance to
centroid
k
K-Means estimating k
Too few; x
many long x
distances xx x
to centroid. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
x
Just right; x
distances xx x
rather short. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Too many; x
little improvement x
in average xx x
distance. x x
x x x x x
x x x x x
x xx x xx x
x x x x
x x
x x x
x x x x
x x x
x
Estimating k in HC
Estimating k
Take the difference of successive index values
The largest ratio difference between two points.
The first data point with a second derivative above some
threshold value.
The data point with the largest second derivative.
The point on the curve that is furthest from a line fitted to
the entire curve.
Find the boundary between the pair of straight lines that
most closely fit the curve.
Stan Salvador, Philip Chan ICTAI 2004
Estimating k
Information criteria
Bayesian Information Criterion
Akaike Information Criterion

Estimating k
BIC
1
BIC L( ) k log n
2
k
ni d n ni k 1
(ni log ni ni log n log( 2 ) i log i ) k log n
i 1 2 2 2 2
ni
1
i || x j Ci ||2
ni k j 1
L() is the log-likelihood function according to the model
i is the maximum likelihood estimate for the variance of the i-th cluster
under the identical spherical Gaussian assumption
Estimating k
BIC
Strategy: first local maximum indicates strong evidence for
the model size
Not always true
Zhao, Hautamaki, Franti, 2008

Estimating k
The main drawback of previous methods:

Given two partitions of a data set with k1 and k2 numbers of
clusters, which one is the best?
To answer the question, a criterion able to offer an ordering of

all partitions irrespective of the number of clusters is
required
Clustering = Optimization
What about the criterion?
The quality of a partition:
Compactness
maximize the similarity between the elements of the

same cluster (minimize the intra-cluster distance)
Separability/Isolation
minimizes the similarity between elements

belonging to different clusters (maximize inter-
cluster distance)
Clustering = Optimization
What about the criterion?
Minimizing the compactness and maximizing the isolation are
appropriate for supervised clustering
In the unsupervised case these lead in the extreme case to n
clusters (n-the number of data items)
Unsupervised clustering criteria
Penalize increasing number of clusters
(Luchian, 1994)
(Luchian, 1999)
Other criteria
Silhouette Width
(Rousseuw,1987)
Davies-Bouldin Index
(Davies, Bouldin, 1979)
Multi-objective optimization
(Handl, Knowles, 2005)
Unsupervised clustering criteria
(M.Breaban & H. Luchian, GECCO 2009)
Synthetic data sets

J.Handl and J. Knowles:
http://dbkgroup.org/handl/generators/
37
Evolutionary algorithms for
clustering
Representation
(variable length) string of cluster centres (Luchian, 1994)
n-length chromosome; allele values from 1 to n; c[i]=j means i
and j are in the same cluster (Handl & Knowles, 2004)
Evaluation
Supervised or unsupervised clustering criteria
Performance:
Same as k-Means if the latter is given the best possible
initialization
In case of Multi-Objective GAs the performance is signifficantly
better
Subspace clustering
In high dimensional data many dimensions are often
irrelevant
Find clusters in different subspaces within a dataset
Algorithms:
Bottom-up
CLIQUE
ENCLUS
MAFIA
Top-down
PROCLUS
COSA
(for a survey and comparative study: Parsons, Haque,Liu
Evaluating Subspace Clustering Algorithms ,2004)
Ensemble clustering
Goals
Reach consensus
Create more robust and stable solutions
Research
Ensemble construction
Bagging
Boosting
Random feature subspaces
Different clustering algorithms
Procedures to combine individual solutions
Ensemble clustering
0 1 0 0
0 0 0 1 agregation Graph
0 0 0 0
partitioning
n
*
C1
C2
Cm
C
k Snxn
Ci - member of ensemble
n - # data items Sij cos(Cip , C qj )
p , q 1..m
ki - # clusters for Ci
41
* METIS - http://glaros.dtc.umn.edu/gkhome/

Clustering PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clustering PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Data Mining: CUSTERING

Solution space for unsupervised clustering:

Use the average distance

The distance between clusters

Impossible to reallocate points

Bayesian Information Criterion

Akaike Information Criterion

L() is the log-likelihood function according to the model

Zhao, Hautamaki, Franti, 2008

The main drawback of previous methods:

To answer the question, a criterion able to offer an ordering of

maximize the similarity between the elements of the

minimizes the similarity between elements

(M.Breaban & H. Luchian, GECCO 2009)

Synthetic data sets

Das könnte Ihnen auch gefallen