Sie sind auf Seite 1von 67

# Pattern

Classification

All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 10
Unsupervised Learning & Clustering

Introduction
Mixture Densities and Identifiability
ML Estimates
Application to Normal Mixtures
K-means algorithm
Unsupervised Bayesian Learning
Data description and clustering
Criterion function for clustering
Hierarchical clustering
The number of cluster problem and cluster validation
On-line clustering
Graph-theoretic methods
PCA and ICA
Low-dim reps and multidimensional scaling (self-
organizing maps)
Clustering and dimensionality reduction
Pattern Classification, Chapter 10
2
Introduction
Previously, all our training samples were labeled: these
samples were said supervised

Why are we interested in unsupervised procedures
which use unlabeled samples?

1) Collecting and Labeling a large set of sample patterns can
be costly

2) We can train with large amounts of (less expensive)
unlabeled data
Then use supervision to label the groupings found, this is
appropriate for large data mining applications where the
contents of a large database are not known beforehand
Pattern Classification, Chapter 10
3
3) Patterns may change slowly with time
Improved performance can be achieved if classifiers
running in a unsupervised mode are used

4) We can use unsupervised methods to identify
features that will then be useful for
categorization
smart feature extraction

5) We gain some insight into the nature (or
structure) of the data
which set of classification labels?
Pattern Classification, Chapter 10
4
Mixture Densities & Identifiability

Assume:
functional forms for underlying probability densities are known
value of an unknown parameter vector must be learned
i.e., like chapter 3 but without class labels

Specific assumptions:
The samples come from a known number c of classes
The prior probabilities P(e
j
) for each class are known (j = 1, ,c)
Forms for the P(x | e
j
, u
j
) (j = 1, ,c) are known
The values of the c parameter vectors u
1
, u
2
, , u
c
are unknown
The category labels are unknown
Pattern Classification, Chapter 10
5
The PDF for the samples is:

This density function is called a mixture density

Our goal will be to use samples drawn from this
mixture density to estimate the unknown
parameter vector u.
Once u is known, we can decompose the mixture
into its components and use a MAP classifier on
the derived densities.
t
c 2 1
c
1 j
parameters mixing
j
densities component
j j
) ,..., , ( where
) ( P . ) , | x ( P ) | x ( P
u u u u
e u e u
=
=

=

Pattern Classification, Chapter 10
6
Can u be recovered from the mixture?
Consider the case where:
Unlimited number of samples
Use nonparametric technique to find p(x|u ) for every x
If several u result in same p(x|u ) cant find unique
solution

This is the issue of solution identifiability.

Definition: Identifiability
A density P(x | u) is said to be identifiable if
u = u implies that there exists an x such that:
P(x | u) = P(x | u)
Pattern Classification, Chapter 10
7
As a simple example, consider the case where x is binary
and
P(x | u) is the mixture:

Assume that:
P(x = 1 | u) = 0.6 P(x = 0 | u) = 0.4
We know P(x | u) but not u
We can say: u
1
+ u
2
= 1.2 but not what u
1
and u
2
are.

Thus, we have a case in which the mixture distribution is
completely unidentifiable, and therefore unsupervised
learning is impossible.

= +
= +
=
+ =

0 x if ) (
2
1
- 1
1 x if ) (
2
1

) 1 (
2
1
) 1 (
2
1
) | x ( P
2 1
2 1
x 1
2
x
2
x 1
1
x
1
u u
u u
u u u u u
Pattern Classification, Chapter 10
8

In the discrete distributions too many components can
be problematic
Too many unknowns
Perhaps more unknowns than independent equations
identifiability can become a serious problem!
Pattern Classification, Chapter 10
9
While it can be shown that mixtures of normal densities are
usually identifiable, the parameters in the simple mixture
density

cannot be uniquely identified if P(e
1
) = P(e
2
)
(we cannot recover a unique u even from an infinite amount of
data!)
u = (u
1
, u
2
) and u = (u
2
, u
1
) are two possible vectors that can be
interchanged without affecting P(x | u).
Identifiability can be a problem, we always assume that the
densities we are dealing with are identifiable!
(

+
(

=
2
2
2
2
1
1
) x (
2
1
exp
2
) ( P
) x (
2
1
exp
2
) ( P
) | x ( P u
t
e
u
t
e
u
Pattern Classification, Chapter 10
10
ML Estimates
Suppose that we have a set D = {x
1
, , x
n
} of n
unlabeled samples drawn independently from
the mixture density:

(u is fixed but unknown!)

The MLE is:

=
=
c
1 j
j j j
) ( P ) , | x ( p ) | x ( p e u e u
[
=
= =
n
1 k
k
) | p(x ) | p(D with ) | ( max arg

u u u u
u
D p
Pattern Classification, Chapter 10
11
ML Estimates

Then the log-likelihood is:

And the gradient of the log-likelihood is:

=
=
n
k
k
x p l
1
) | ( ln u
) , | x ( p ln ) , x | ( P l
i i k k
n
1 k
i
i i
u e u e
u u
V = V
=
Pattern Classification, Chapter 10
12
Since the gradient must vanish at the value of u
i

that maximizes l ,

the ML estimate must satisfy the conditions

=
=
n
1 k
k
)) | x ( p ln l ( u
i
u

) ( c) 1,..., (i 0 )

, | ( ln )

, | (
1
a x p x P
n
k
i i k k i
i

=
= = V u e u e
u
Pattern Classification, Chapter 10
13

The MLE for P(e
i
) and must satisfy:

=
=
=
= V
=
c
1 j
j j j k
i i i k
k i
i i k k i
n
1 k
k i i
) ( P

, | x ( p
) ( P

, | x ( p
)

, x | ( P

: where
0 )

, | x ( p ln )

, x | ( P

and
)

, x | ( P

n
1
) ( P

i
e u e
e u e
u e
u e u e
u e e
u
i
u

## Pattern Classification, Chapter 10

14
Applications to Normal Mixtures
p(x | e
i
, u
i
) ~ N(
i
, E
i
)

Case 1 = Simplest case
Case 2 = more realistic case

Case
i
E
i
P(e
i
) c
1 ? x x x
2 ? ? ? x
3 ? ? ? ?
Pattern Classification, Chapter 10
15
Case 1: Multivariate Normal, Unknown mean vectors

i
= u
i
i = 1, , c, The likelihood is for the i
th
mean is:

ML estimate of = (
i
) is:

Where is the fraction of those samples

having value x
k
that come from the ith class, and is the average
of the samples coming from the i-th class.
)

, x | ( P
k i
e
i

(1)
)

, | (
)

, | (

1
1

=
=
=
n
k
k i
n
k
k k i
i
x P
x x P
e
e

## Pattern Classification, Chapter 10

16
Unfortunately, equation (1) does not give
explicitly

However, if we have some way of obtaining good
initial estimates for the unknown means,
equation (1) can be seen as an iterative process
for improving the estimates
i

) 0 (

=
=
= +
n
1 k
k i
n
1 k
k k i
i
)) j (

, x | ( P
x )) j (

, x | ( P
) 1 j (

e
e

## Pattern Classification, Chapter 10

17
This is a gradient ascent for maximizing the log-
likelihood function

Example:
Consider the simple two-component one-dimensional
normal mixture

(2 clusters!)
Lets set
1
= -2,
2
= 2 and draw 25 samples
sequentially from this mixture. The log-likelihood
function is:
(

t
+
(

t
=
2
2
2
1 2 1
) x (
2
1
exp
2 3
2
) x (
2
1
exp
2 3
1
) , | x ( p

=
=
n
1 k
2 1 k 2 1
) , | x ( p ln ) , ( l
e1
e2
Pattern Classification, Chapter 10
18
The maximum value of l occurs at:

(which are not far from the true values:
1
= -2 and
2

= +2)

There is another peak at
which has almost the same height as can be seen
from the following figure.
This mixture of normal densities is identifiable
When the mixture density is not identifiable, the ML
solution is not unique

668 . 1

and 130 . 2

2 1
= =
257 . 1

and 085 . 2

2 1
= =
Pattern Classification, Chapter 10
19
Pattern Classification, Chapter 10
20
Case 2: All parameters unknown

No constraints are placed on the covariance
matrix

Let p(x | , o
2
) be the two-component normal
mixture:

(

+
(
(

|
.
|

\
|

=
2
2
2
x
2
1
exp
2 2
1 x
2
1
exp
. 2 2
1
) , | x ( p
t
o

o t
o
Pattern Classification, Chapter 10
21
Suppose = x
1
, therefore:

For the rest of the samples:

Finally,

The likelihood is therefore large and the maximum-
likelihood solution becomes singular.

(

+ =
2
1
2
1
2
1
exp
2 2
1
2 2
1
) , | ( x x p
t o t
o
(

)
`

+ >

=
|
.
|

\
|

n
2 k
2
k
n
0
term this
2
1
2
n 1
x
2
1
exp
) 2 2 (
1
x
2
1
exp
1
) , | x ,..., x ( p
t
o
o
o

(

>
2
k
2
k
x
2
1
exp
2 2
1
) , | x ( p
t
o
Pattern Classification, Chapter 10
22
Assumption: MLE is well-behaved at local maxima.
Consider the largest of the finite local maxima of
the likelihood function and use the ML estimation.
We obtain the following local-maximum-likelihood
estimates:

=
=
=
=
=

=
=
=
n
1 k
k i
n
1 k
t
i k i k k i
i
n
1 k
k i
n
1 k
k k i
i
n
1 k
k i i
)

, x | ( P

x )(

x )(

, x | ( P

, x | ( P

x )

, x | ( P

, x | ( P

n
1
) ( P

u e
u e
E
u e
u e

u e e
Iterative
scheme
Pattern Classification, Chapter 10
23

Where:

=
c
1 j
j j k
1
j
t
j k
2 / 1
j
i i k
1
i
t
i k
2 / 1
i
k i
) ( P

x (

x (
2
1
exp

) ( P

x (

x (
2
1
exp
)

, x | ( P

e E E
e E E
u e
Pattern Classification, Chapter 10
24
K-Means Clustering

Goal: find the c mean vectors
1
,
2
, ,
c

Replace the squared Mahalanobis distance

Find the mean nearest to x
k
and approximate

as:

Use the iterative scheme to find

2
i k i k
1
i
t
i k

x distance Euclidean squared the by )

x (

x ( E

m

, x | ( P

k i
u e

=
~
otherwise 0
m i if 1
) , x | ( P

k i
u e
c 2 1

,...,

Pattern Classification, Chapter 10
25

If n is the known number of patterns and c the desired
number of clusters, the k-means algorithm is:

Begin
initialize n, c,
1
,
2
, ,
c
(randomly
selected)
do classify n samples according to
nearest
i
recompute
i

until no change in
i

return
1
,
2
, ,
c
End

Complexity is O(ndcT) where d is the # features, T the # iterations
Pattern Classification, Chapter 10
26
K-means cluster on data from previous figure
Pattern Classification, Chapter 10
27
Unsupervised Bayesian Learning
Other than the ML estimate, the Bayesian estimation
technique can also be used in the unsupervised case
(see chapters ML & Bayesian methods, Chap. 3 of the
textbook)
number of classes is known
class priors are known
forms of class-conditional probability densities P(x|e
j
, u
j
) are
known
However, the full parameter vector u is unknown
Part of our knowledge about u is contained in the prior p(u)
rest of our knowledge of u is in the training samples
We compute the posterior distribution using the training
samples
Pattern Classification, Chapter 10
28
We can compute p(u|D) as seen previously

and passing through the usual formulation introducing the
unknown parameter vector u.

Hence, the best estimate of p(x|e
i
) is obtained by
averaging p(x|e
i
, u
i
) over u
i
.
The goodness of this estimate depends on p(u|D); this is
the main issue of the problem.

= =
= =
c
j
j j
i i
c
j
j j
i i
i
P D p
P D p
D P D p
D P D p
D P
1 1
) ( ) , | (
) ( ) , | (
) | ( ) , | (
) | ( ) , | (
) , | (
e e
e e
e e
e e
e
x
x
x
x
x
x
x x x
d D p p
d ,D p ,D , p d D p D p
i i
i i i i
) | ( ) , | (
) ( ) | ( ) , | , ( ) , | (
}
} }
=
= =
e
e e e e
P(e
i
|D) = P(e
i
) since selection of e
i
is independent of previous samples
Pattern Classification, Chapter 10
29
From Bayes we get:

where independence of the samples yields the likelihood

or alternately (denoting D
n
the set of n samples) the recursive
form:

If p(u) is almost uniform in the region where p(D|u) peaks,
then p(u|D) peaks in the same place.

[
=
=
n
k
k
p D p
1
) | ( ) | ( x
}

=
x
x

d D p p
D p p
D p
i
n
n
i
n
n
n
) , | ( ) | (
) , | ( ) | (
) | (
1
1
e
e
}
=

d p D p
p D p
D p
) ( ) | (
) ( ) | (
) | (
Pattern Classification, Chapter 10
30
If the only significant peak occurs at and the peak is
very sharp, then

and

Therefore, the ML estimate is justified.
Both approaches coincide if large amounts of data are
available.
In small sample size problems they can agree or not,
depending on the form of the distributions
The ML method is typically easier
to implement than the Bayesian one

=
)

, | ( ) , | ( x x
i i
p D p e e ~

=
~
c
j
j j j
i i i
i
P p
P p
D P
1
) ( )

, | (
) ( )

, | (
) , | (
e e
e e
e
x
x
x

## Pattern Classification, Chapter 10

31
Formal Bayesian solution: unsupervised learning of the
parameters of a mixture density is similar to the supervised
learning of the parameters of a component density.
Significant differences: identifiability, computational
complexity
The issue of identifiability
With SL, the lack of identifiability means that we do not obtain a
unique vector, but an equivalence class, which does not present
theoretical difficulty as all yield the same component density.
With UL, the lack of identifiability means that the mixture cannot be
decomposed into its true components
p(x | D
n
) may still converge to p(x), but p(x |e
i
, D
n
) will not in
general converge to p(x |e
i
), hence there is a theoretical barrier.
The computational complexity
With SL, the sufficient statistics allows the solutions to be
computationally feasible
Pattern Classification, Chapter 10
32
With UL, samples comes from a mixture density and
there is little hope of finding simple exact solutions for
p(D | u). n samples results in 2
n
terms.
(Corresponding to the ways in the which the n samples
can be drawn from the 2 classes.)
Another way of comparing the UL and SL is to
consider the usual equation in which the mixture
density is explicit

) | (
) | ( ) ( ) , | (
) ( ) , | (
) | ( ) | (
) | ( ) | (
) | (
1
1
1
1
1
1

}

}
=
= =
n
c
j
n
j j j n
c
j
j j j n
n
n
n
n
n
D p
d D p P p
P p
d D p p
D p p
D p

x
x
x
x

e e
e e
Pattern Classification, Chapter 10
33

If we consider the case in which P(e
1
)=1 and all
other prior probabilities as zero, corresponding to
the supervised case in which all samples comes
from the class e
1
, then we get
) | (
) | ( ) , | (
) , | (
) | (
1
1
1 1
1 1

}
=
n
n
n
n
n
D p
d D p p
p
D p
x
x

e
e
) | (
) | ( ) ( ) , | (
) ( ) , | (
) | (
1
1
1
1

=

}

=
n
c
j
n
j j j n
c
j
j j j n
n
D p
d D p P p
P p
D p
x
x

e e
e e
From
Previous
slide
Pattern Classification, Chapter 10
34

Comparing the two eqns, we see that observing an additional sample
changes the estimate of u.
Ignoring the denominator which is independent of u, the only significant
difference is that
in the SL, we multiply the prior density for u by the component density p(x
n

|e
1
, u
1
)
in the UL, we multiply the prior density by the whole mixture

Assuming that the sample did come from class e
1
, the effect of not knowing this
category is to diminish the influence of x
n
in changing u for category 1.
.

) | (
) | ( ) , | (
) , | (
) | (
1
1
1 1
1 1

}
=
n
n
n
n
n
D p
d D p p
p
D p
x
x

e
e

=
c
j
j j j n
P p
1
) ( ) , | ( e e x
) | (
) | ( ) ( ) , | (
) ( ) , | (
) | (
1
1
1
1
=

}

=
n
c
j
n
j j j n
c
j
j j j n
n
D p
d D p P p
P p
D p
x
x

e e
e e
Eqns From
Previous
slide
Pattern Classification, Chapter 10
35
Data Clustering
Structures of multidimensional patterns are important
for clustering
If we know that data come from a specific distribution,
such data can be represented by a compact set of
parameters (sufficient statistics)
If samples are considered
coming from a specific
distribution, but actually they
are not, these statistics is a
the data
Pattern Classification, Chapter 10
36

Aproximation of density functions:
Mixture of normal distributions can approximate arbitrary
PDFs

In these cases, one can use parametric methods
to estimate the parameters of the mixture density.
No free lunch dimensionality issue!
Huh?

Pattern Classification, Chapter 10
37
Caveat
If little prior knowledge can be assumed, the
assumption of a parametric form is meaningless:
Issue: imposing structure vs finding structure

use non parametric method to estimate the
unknown mixture density.

Alternatively, for subclass discovery:
use a clustering procedure
identify data points having strong internal similarities
Pattern Classification, Chapter 10
38
Similarity measures
What do we mean by similarity?
Two isses:
How to measure the similarity between samples?
How to evaluate a partitioning of a set into clusters?

Obvious measure of similarity/dissimilarity is the
distance between samples

Samples of the same cluster should be closer to
each other than to samples in different classes.
Pattern Classification, Chapter 10
39
Euclidean distance is a possible metric:
assume samples belonging to same cluster if their
distance is less than a threshold d
0

Clusters defined by Euclidean distance are
invariant to translations and rotation of the feature
space, but not invariant to general transformations
that distort the distance relationship
Pattern Classification, Chapter 10
40
Achieving invariance:
normalize the data, e.g., such that they all have zero
means and unit variance,
or use principal components for invariance to rotation
A broad class of metrics is the Minkowsky metric

where q>1 is a selectable parameter:
q = 1 Manhattan or city block metric
q = 2 Euclidean metric
One can also used a nonmetric similarity function
s(x,x) to compare 2 vectors.
q
d
k
q
k k
x x d
/ 1
1
' ) , ( |
.
|

\
|
=

=
x' x
Pattern Classification, Chapter 10
41
It is typically a symmetric function whose value is
large when x and x are similar.
For example, the inner product

In case of binary-valued features, we have, e.g.:

'
'
) , (
x x
x x
x' x
t
s =
d
s
t
'
) , (
x x
x' x =
' ' '
'
) , (
x x x x x x
x x
x' x
t t t
t
s
+ +
=
Tanimoto distance
Pattern Classification, Chapter 10
42
Clustering as optimization
The second issue: how to evaluate a partitioning of
a set into clusters?

Clustering can be posed as an optimization of a
criterion function
The sum-of-squared-error criterion and its variants
Scatter criteria
The sum-of-squared-error criterion
Let n
i
the number of samples in D
i
, and m
i
the mean of
those samples

e
=
i
D
i
i
n
x
x m
1
Pattern Classification, Chapter 10
43
The sum of squared error is defined as

This criterion defines clusters by their mean vectors m
i
it minimizes the sum of the squared lengths of the error x - m
i
.
The minimum variance partition minimizes J
e

Results:
Good when clusters form well separated compact clouds
Bad with large differences in the number of samples in different
clusters.

= e
=
c
i D
i e
i
J
1
2
x
m x
Pattern Classification, Chapter 10
44
Scatter criteria
Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix S
W
and the between-
scatter matrix S
B
S
T
= S
B
+S
W

Note:
S
T
does not depend on partitioning
In contrast, S
B
and S
W
depend on partitioning
Two approaches:
minimize the within-cluster
maximize the between-cluster scatter
Pattern Classification, Chapter 10
45

The trace (sum of diagonal elements) is the
simplest scalar measure of the scatter matrix

proportional to the sum of the variances in the
coordinate directions
This is the sum-of-squared-error criterion, J
e
.
| | | |
e
c
i D
i
c
i
i W
J S tr S tr
i
= = =

= e = 1
2
1 x
m x
Pattern Classification, Chapter 10
46

As tr[S
T
] = tr[S
W
] + tr[S
B
] and tr[S
T
] is independent from
the partitioning, no new results can be derived by
minimizing tr[S
B
]

However, seeking to minimize the within-cluster criterion
J
e
=tr[S
W
], is equivalent to maximise the between-cluster
criterion

where m is the total mean vector:
| |

=
=
c
i
i i B
n S tr
1
2
m m

=
= =
c
i
i i
D
n
n n
1
1 1
m x m
Pattern Classification, Chapter 10
47
Iterative optimization
Clustering discrete optimization problem

Finite data set finite number of partitions

What is the cost of exhaustive search?
c
n
/c! For c clusters. Not a good idea

Typically iterative optimization used:
starting from a reasonable initial partition
Redistribute samples to minimize criterion function.
guarantees local, not global, optimization.
Pattern Classification, Chapter 10
48
consider an iterative procedure to minimize the
sum-of-squared-error criterion J
e

where J
i
is the effective error per cluster.

Moving sample from cluster D
i
to D
j
, changes
the errors in the 2 clusters by:

e =
= =
i
D
i i
c
i
i e
J J J
x
m x
2
1
where
2

1
*
j
j
j
j j
n
n
J J m x
+
+ =
x

1
*
i
i
i
i i
n
n
J J m x

=
Pattern Classification, Chapter 10
49
Hence, the transfer is advantegeous if the
decrease in J
i
is larger than the increase in J
j

2
2

1
j
j
j
i
i
i
n
n
n
n
m x m x
+
>

## Pattern Classification, Chapter 10

50
Alg. 3 is sequential version of the k-means alg.
Alg. 3 updates each time a sample is reclassified
k-means waits until n samples have been reclassified
before updating

Alg 3 can get trapped in local minima
Depends on order of the samples
Basically, myopic approach
But it is online!
Pattern Classification, Chapter 10
51
Starting point is always a problem
Approaches:
1. Random centers of clusters
2. Repetition with different random initialization
3. c-cluster starting point as the solution of the (c-1)-
cluster problem plus the sample farthest from the
nearer cluster center
Pattern Classification, Chapter 10
52
Hierarchical Clustering
Many times, clusters are not disjoint, but a
cluster may have subclusters, in turn having sub-
subclusters, etc.
Consider a sequence of partitions of the n
samples into c clusters
The first is a partition into n cluster, each one
containing exactly one sample
The second is a partition into n-1 clusters, the third
into n-2, and so on, until the n-th in which there is only
one cluster containing all of the samples
At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
53
Given any two samples x and x, they will be grouped
together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
Hierarchical clustering tree representation called
dendrogram

Pattern Classification, Chapter 10
54
Are groupings natural or forced: check similarity values
Evenly distributed similarity no justification for grouping

Another representation is based on set, e.g., on the Venn
diagrams
Pattern Classification, Chapter 10
55
Hierarchical clustering can be divided in
agglomerative and divisive.

singleton cluster and form the sequence by
merging clusters

samples in one cluster and form the sequence by
successively splitting clusters

Pattern Classification, Chapter 10
56
Agglomerative hierarchical clustering

The procedure terminates when the specified
number of cluster has been obtained, and returns
the cluster as sets of points, rather than the mean
or a representative vector for each cluster
Pattern Classification, Chapter 10
57
At any level, the distance between nearest clusters
can provide the dissimilarity value for that level
To find the nearest clusters, one can use

which behave quite similar of the clusters are
hyperspherical and well separated.
The computational complexity is O(cn
2
d
2
), n>>c
' min ) , (
' ,
min
x x
x x
=
e e
j i
D D
j i
D D d
' max ) , (
' ,
max
x x
x x
=
e e
j i
D D
j i
D D d

e e
=
i j
D D
j i
j i avg
n n
D D d
x x
x x
'
'
1
) , (
j i j i mean
D D d m m = ) , (
Pattern Classification, Chapter 10
58
d
min
is used

Viewed in graph terms, an edge is added to the
nearest nonconnected components

Equivalent of Prims minimum spanning tree
algorithm

Terminates when the distance between nearest
clusters exceeds an arbitrary threshold
Pattern Classification, Chapter 10
59
The use of d
min
as a distance measure and the
agglomerative clustering generate a minimal
spanning tree

Chaining effect: defect of this distance measure
(right)

Pattern Classification, Chapter 10
60
The farthest neighbor algorithm (complete linkage)
d
max
is used

This method discourages the growth of elongated
clusters

In graph theoretic terms:
every cluster is a complete subgraph
the distance between two clusters is determined by the
most distant nodes in the 2 clusters

terminates when the distance between nearest
clusters exceeds an arbitrary threshold

Pattern Classification, Chapter 10
61
When two clusters are merged, the graph is
changed by adding edges between every pair of
nodes in the 2 clusters

All the procedures involving minima or maxima are
sensitive to outliers. The use of d
mean
or d
avg
are
natural compromises
Pattern Classification, Chapter 10
62
The problem of the number of clusters
How many clusters should there be?
For clustering by extremizing a criterion function
repeat the clustering with c=1, c=2, c=3, etc.
look for large changes in criterion function
Alternatively:
state a threshold for the creation of a new cluster
useful for on line cases
sensitive to order of presentation of data.
These approaches are similar to model selection
procedures
Pattern Classification, Chapter 10
63
Graph-theoretic methods
Caveat: no uniform way of posing clustering as a
graph theoretic problem
Generalize from a threshold distance to arbitrary
similarity measures.
If s
0
is a threshold value, we can say that x
i
is
similar to x
j
if s(x
i
, x
j
) > s
0
.
We can define a similarity matrix S = [s
ij
]

>
=
otherwise 0
) , ( if 1
0
s s
s
j i
ij
x x
Pattern Classification, Chapter 10
64
This matrix induces a similarity graph, dual to S, in
which nodes corresponds to points and edge joins
node i and j iff s
ij
=1.
Single-linkage alg.: two samples x and x are in
the same cluster if there exists a chain x, x
1
, x
2
,
, x
k
, x, such that x is similar to x
1
, x
1
to x
2
, and
so on connected components of the graph
Complete-link alg.: all samples in a given cluster
must be similar to one another and no sample can
be in more than one cluster.
Neirest-neighbor algorithm is a method to find the
minimum spanning tree and vice versa
Removal of the longest edge produce a 2-cluster
grouping, removal of the next longest edge produces a
3-cluster grouping, and so on.
Pattern Classification, Chapter 10
65
This is a divisive hierarchical procedure, and
suggest ways to dividing the graph in subgraphs
E.g., in selecting an edge to remove, comparing its
length with the lengths of the other edges incident the
nodes

Pattern Classification, Chapter 10
66
One useful statistic to be estimated from the
minimal spanning tree is the edge length
distribution
For instance, in the case of 2 dense cluster
immersed in a sparse set of points: