Classification
All materials in these slides were taken from
Pattern Classification (2nd ed) by R. O.
Duda, P. E. Hart and D. G. Stork, John Wiley
& Sons, 2000
with the permission of the authors and the
publisher
Chapter 10
Unsupervised Learning & Clustering
Introduction
Mixture Densities and Identifiability
ML Estimates
Application to Normal Mixtures
Kmeans algorithm
Unsupervised Bayesian Learning
Data description and clustering
Criterion function for clustering
Hierarchical clustering
The number of cluster problem and cluster validation
Online clustering
Graphtheoretic methods
PCA and ICA
Lowdim reps and multidimensional scaling (self
organizing maps)
Clustering and dimensionality reduction
Pattern Classification, Chapter 10
2
Introduction
Previously, all our training samples were labeled: these
samples were said supervised
Why are we interested in unsupervised procedures
which use unlabeled samples?
1) Collecting and Labeling a large set of sample patterns can
be costly
2) We can train with large amounts of (less expensive)
unlabeled data
Then use supervision to label the groupings found, this is
appropriate for large data mining applications where the
contents of a large database are not known beforehand
Pattern Classification, Chapter 10
3
3) Patterns may change slowly with time
Improved performance can be achieved if classifiers
running in a unsupervised mode are used
4) We can use unsupervised methods to identify
features that will then be useful for
categorization
smart feature extraction
5) We gain some insight into the nature (or
structure) of the data
which set of classification labels?
Pattern Classification, Chapter 10
4
Mixture Densities & Identifiability
Assume:
functional forms for underlying probability densities are known
value of an unknown parameter vector must be learned
i.e., like chapter 3 but without class labels
Specific assumptions:
The samples come from a known number c of classes
The prior probabilities P(e
j
) for each class are known (j = 1, ,c)
Forms for the P(x  e
j
, u
j
) (j = 1, ,c) are known
The values of the c parameter vectors u
1
, u
2
, , u
c
are unknown
The category labels are unknown
Pattern Classification, Chapter 10
5
The PDF for the samples is:
This density function is called a mixture density
Our goal will be to use samples drawn from this
mixture density to estimate the unknown
parameter vector u.
Once u is known, we can decompose the mixture
into its components and use a MAP classifier on
the derived densities.
t
c 2 1
c
1 j
parameters mixing
j
densities component
j j
) ,..., , ( where
) ( P . ) ,  x ( P )  x ( P
u u u u
e u e u
=
=
=
Pattern Classification, Chapter 10
6
Can u be recovered from the mixture?
Consider the case where:
Unlimited number of samples
Use nonparametric technique to find p(xu ) for every x
If several u result in same p(xu ) cant find unique
solution
This is the issue of solution identifiability.
Definition: Identifiability
A density P(x  u) is said to be identifiable if
u = u implies that there exists an x such that:
P(x  u) = P(x  u)
Pattern Classification, Chapter 10
7
As a simple example, consider the case where x is binary
and
P(x  u) is the mixture:
Assume that:
P(x = 1  u) = 0.6 P(x = 0  u) = 0.4
We know P(x  u) but not u
We can say: u
1
+ u
2
= 1.2 but not what u
1
and u
2
are.
Thus, we have a case in which the mixture distribution is
completely unidentifiable, and therefore unsupervised
learning is impossible.
= +
= +
=
+ =
0 x if ) (
2
1
 1
1 x if ) (
2
1
) 1 (
2
1
) 1 (
2
1
)  x ( P
2 1
2 1
x 1
2
x
2
x 1
1
x
1
u u
u u
u u u u u
Pattern Classification, Chapter 10
8
In the discrete distributions too many components can
be problematic
Too many unknowns
Perhaps more unknowns than independent equations
identifiability can become a serious problem!
Pattern Classification, Chapter 10
9
While it can be shown that mixtures of normal densities are
usually identifiable, the parameters in the simple mixture
density
cannot be uniquely identified if P(e
1
) = P(e
2
)
(we cannot recover a unique u even from an infinite amount of
data!)
u = (u
1
, u
2
) and u = (u
2
, u
1
) are two possible vectors that can be
interchanged without affecting P(x  u).
Identifiability can be a problem, we always assume that the
densities we are dealing with are identifiable!
(
+
(
=
2
2
2
2
1
1
) x (
2
1
exp
2
) ( P
) x (
2
1
exp
2
) ( P
)  x ( P u
t
e
u
t
e
u
Pattern Classification, Chapter 10
10
ML Estimates
Suppose that we have a set D = {x
1
, , x
n
} of n
unlabeled samples drawn independently from
the mixture density:
(u is fixed but unknown!)
The MLE is:
=
=
c
1 j
j j j
) ( P ) ,  x ( p )  x ( p e u e u
[
=
= =
n
1 k
k
)  p(x )  p(D with )  ( max arg
u u u u
u
D p
Pattern Classification, Chapter 10
11
ML Estimates
Then the loglikelihood is:
And the gradient of the loglikelihood is:
=
=
n
k
k
x p l
1
)  ( ln u
) ,  x ( p ln ) , x  ( P l
i i k k
n
1 k
i
i i
u e u e
u u
V = V
=
Pattern Classification, Chapter 10
12
Since the gradient must vanish at the value of u
i
that maximizes l ,
the ML estimate must satisfy the conditions
=
=
n
1 k
k
))  x ( p ln l ( u
i
u
) ( c) 1,..., (i 0 )
,  ( ln )
,  (
1
a x p x P
n
k
i i k k i
i
=
= = V u e u e
u
Pattern Classification, Chapter 10
13
The MLE for P(e
i
) and must satisfy:
=
=
=
= V
=
c
1 j
j j j k
i i i k
k i
i i k k i
n
1 k
k i i
) ( P
,  x ( p
) ( P
,  x ( p
)
, x  ( P
: where
0 )
,  x ( p ln )
, x  ( P
and
)
, x  ( P
n
1
) ( P
i
e u e
e u e
u e
u e u e
u e e
u
i
u
, x  ( P
k i
e
i
(1)
)
,  (
)
,  (
1
1
=
=
=
n
k
k i
n
k
k k i
i
x P
x x P
e
e
) 0 (
=
=
= +
n
1 k
k i
n
1 k
k k i
i
)) j (
, x  ( P
x )) j (
, x  ( P
) 1 j (
e
e
t
+
(
t
=
2
2
2
1 2 1
) x (
2
1
exp
2 3
2
) x (
2
1
exp
2 3
1
) ,  x ( p
=
=
n
1 k
2 1 k 2 1
) ,  x ( p ln ) , ( l
e1
e2
Pattern Classification, Chapter 10
18
The maximum value of l occurs at:
(which are not far from the true values:
1
= 2 and
2
= +2)
There is another peak at
which has almost the same height as can be seen
from the following figure.
This mixture of normal densities is identifiable
When the mixture density is not identifiable, the ML
solution is not unique
668 . 1
and 130 . 2
2 1
= =
257 . 1
and 085 . 2
2 1
= =
Pattern Classification, Chapter 10
19
Pattern Classification, Chapter 10
20
Case 2: All parameters unknown
No constraints are placed on the covariance
matrix
Let p(x  , o
2
) be the twocomponent normal
mixture:
(
+
(
(

.

\

=
2
2
2
x
2
1
exp
2 2
1 x
2
1
exp
. 2 2
1
) ,  x ( p
t
o
o t
o
Pattern Classification, Chapter 10
21
Suppose = x
1
, therefore:
For the rest of the samples:
Finally,
The likelihood is therefore large and the maximum
likelihood solution becomes singular.
(
+ =
2
1
2
1
2
1
exp
2 2
1
2 2
1
) ,  ( x x p
t o t
o
(
)
`
+ >
=

.

\

n
2 k
2
k
n
0
term this
2
1
2
n 1
x
2
1
exp
) 2 2 (
1
x
2
1
exp
1
) ,  x ,..., x ( p
t
o
o
o
(
>
2
k
2
k
x
2
1
exp
2 2
1
) ,  x ( p
t
o
Pattern Classification, Chapter 10
22
Assumption: MLE is wellbehaved at local maxima.
Consider the largest of the finite local maxima of
the likelihood function and use the ML estimation.
We obtain the following localmaximumlikelihood
estimates:
=
=
=
=
=
=
=
=
n
1 k
k i
n
1 k
t
i k i k k i
i
n
1 k
k i
n
1 k
k k i
i
n
1 k
k i i
)
, x  ( P
x )(
x )(
, x  ( P
, x  ( P
x )
, x  ( P
, x  ( P
n
1
) ( P
u e
u e
E
u e
u e
u e e
Iterative
scheme
Pattern Classification, Chapter 10
23
Where:
=
c
1 j
j j k
1
j
t
j k
2 / 1
j
i i k
1
i
t
i k
2 / 1
i
k i
) ( P
x (
x (
2
1
exp
) ( P
x (
x (
2
1
exp
)
, x  ( P
e E E
e E E
u e
Pattern Classification, Chapter 10
24
KMeans Clustering
Goal: find the c mean vectors
1
,
2
, ,
c
Replace the squared Mahalanobis distance
Find the mean nearest to x
k
and approximate
as:
Use the iterative scheme to find
2
i k i k
1
i
t
i k
x distance Euclidean squared the by )
x (
x ( E
m
, x  ( P
k i
u e
=
~
otherwise 0
m i if 1
) , x  ( P
k i
u e
c 2 1
,...,
Pattern Classification, Chapter 10
25
If n is the known number of patterns and c the desired
number of clusters, the kmeans algorithm is:
Begin
initialize n, c,
1
,
2
, ,
c
(randomly
selected)
do classify n samples according to
nearest
i
recompute
i
until no change in
i
return
1
,
2
, ,
c
End
Complexity is O(ndcT) where d is the # features, T the # iterations
Pattern Classification, Chapter 10
26
Kmeans cluster on data from previous figure
Pattern Classification, Chapter 10
27
Unsupervised Bayesian Learning
Other than the ML estimate, the Bayesian estimation
technique can also be used in the unsupervised case
(see chapters ML & Bayesian methods, Chap. 3 of the
textbook)
number of classes is known
class priors are known
forms of classconditional probability densities P(xe
j
, u
j
) are
known
However, the full parameter vector u is unknown
Part of our knowledge about u is contained in the prior p(u)
rest of our knowledge of u is in the training samples
We compute the posterior distribution using the training
samples
Pattern Classification, Chapter 10
28
We can compute p(uD) as seen previously
and passing through the usual formulation introducing the
unknown parameter vector u.
Hence, the best estimate of p(xe
i
) is obtained by
averaging p(xe
i
, u
i
) over u
i
.
The goodness of this estimate depends on p(uD); this is
the main issue of the problem.
= =
= =
c
j
j j
i i
c
j
j j
i i
i
P D p
P D p
D P D p
D P D p
D P
1 1
) ( ) ,  (
) ( ) ,  (
)  ( ) ,  (
)  ( ) ,  (
) ,  (
e e
e e
e e
e e
e
x
x
x
x
x
x
x x x
d D p p
d ,D p ,D , p d D p D p
i i
i i i i
)  ( ) ,  (
) ( )  ( ) ,  , ( ) ,  (
}
} }
=
= =
e
e e e e
P(e
i
D) = P(e
i
) since selection of e
i
is independent of previous samples
Pattern Classification, Chapter 10
29
From Bayes we get:
where independence of the samples yields the likelihood
or alternately (denoting D
n
the set of n samples) the recursive
form:
If p(u) is almost uniform in the region where p(Du) peaks,
then p(uD) peaks in the same place.
[
=
=
n
k
k
p D p
1
)  ( )  ( x
}
=
x
x
d D p p
D p p
D p
i
n
n
i
n
n
n
) ,  ( )  (
) ,  ( )  (
)  (
1
1
e
e
}
=
d p D p
p D p
D p
) ( )  (
) ( )  (
)  (
Pattern Classification, Chapter 10
30
If the only significant peak occurs at and the peak is
very sharp, then
and
Therefore, the ML estimate is justified.
Both approaches coincide if large amounts of data are
available.
In small sample size problems they can agree or not,
depending on the form of the distributions
The ML method is typically easier
to implement than the Bayesian one
=
)
,  ( ) ,  ( x x
i i
p D p e e ~
=
~
c
j
j j j
i i i
i
P p
P p
D P
1
) ( )
,  (
) ( )
,  (
) ,  (
e e
e e
e
x
x
x
}
}
=
= =
n
c
j
n
j j j n
c
j
j j j n
n
n
n
n
n
D p
d D p P p
P p
d D p p
D p p
D p
x
x
x
x
e e
e e
Pattern Classification, Chapter 10
33
If we consider the case in which P(e
1
)=1 and all
other prior probabilities as zero, corresponding to
the supervised case in which all samples comes
from the class e
1
, then we get
)  (
)  ( ) ,  (
) ,  (
)  (
1
1
1 1
1 1
}
=
n
n
n
n
n
D p
d D p p
p
D p
x
x
e
e
)  (
)  ( ) ( ) ,  (
) ( ) ,  (
)  (
1
1
1
1
=
}
=
n
c
j
n
j j j n
c
j
j j j n
n
D p
d D p P p
P p
D p
x
x
e e
e e
From
Previous
slide
Pattern Classification, Chapter 10
34
Comparing the two eqns, we see that observing an additional sample
changes the estimate of u.
Ignoring the denominator which is independent of u, the only significant
difference is that
in the SL, we multiply the prior density for u by the component density p(x
n
e
1
, u
1
)
in the UL, we multiply the prior density by the whole mixture
Assuming that the sample did come from class e
1
, the effect of not knowing this
category is to diminish the influence of x
n
in changing u for category 1.
.
)  (
)  ( ) ,  (
) ,  (
)  (
1
1
1 1
1 1
}
=
n
n
n
n
n
D p
d D p p
p
D p
x
x
e
e
=
c
j
j j j n
P p
1
) ( ) ,  ( e e x
)  (
)  ( ) ( ) ,  (
) ( ) ,  (
)  (
1
1
1
1
=
}
=
n
c
j
n
j j j n
c
j
j j j n
n
D p
d D p P p
P p
D p
x
x
e e
e e
Eqns From
Previous
slide
Pattern Classification, Chapter 10
35
Data Clustering
Structures of multidimensional patterns are important
for clustering
If we know that data come from a specific distribution,
such data can be represented by a compact set of
parameters (sufficient statistics)
If samples are considered
coming from a specific
distribution, but actually they
are not, these statistics is a
misleading representation of
the data
Pattern Classification, Chapter 10
36
Aproximation of density functions:
Mixture of normal distributions can approximate arbitrary
PDFs
In these cases, one can use parametric methods
to estimate the parameters of the mixture density.
No free lunch dimensionality issue!
Huh?
Pattern Classification, Chapter 10
37
Caveat
If little prior knowledge can be assumed, the
assumption of a parametric form is meaningless:
Issue: imposing structure vs finding structure
use non parametric method to estimate the
unknown mixture density.
Alternatively, for subclass discovery:
use a clustering procedure
identify data points having strong internal similarities
Pattern Classification, Chapter 10
38
Similarity measures
What do we mean by similarity?
Two isses:
How to measure the similarity between samples?
How to evaluate a partitioning of a set into clusters?
Obvious measure of similarity/dissimilarity is the
distance between samples
Samples of the same cluster should be closer to
each other than to samples in different classes.
Pattern Classification, Chapter 10
39
Euclidean distance is a possible metric:
assume samples belonging to same cluster if their
distance is less than a threshold d
0
Clusters defined by Euclidean distance are
invariant to translations and rotation of the feature
space, but not invariant to general transformations
that distort the distance relationship
Pattern Classification, Chapter 10
40
Achieving invariance:
normalize the data, e.g., such that they all have zero
means and unit variance,
or use principal components for invariance to rotation
A broad class of metrics is the Minkowsky metric
where q>1 is a selectable parameter:
q = 1 Manhattan or city block metric
q = 2 Euclidean metric
One can also used a nonmetric similarity function
s(x,x) to compare 2 vectors.
q
d
k
q
k k
x x d
/ 1
1
' ) , ( 
.

\

=
=
x' x
Pattern Classification, Chapter 10
41
It is typically a symmetric function whose value is
large when x and x are similar.
For example, the inner product
In case of binaryvalued features, we have, e.g.:
'
'
) , (
x x
x x
x' x
t
s =
d
s
t
'
) , (
x x
x' x =
' ' '
'
) , (
x x x x x x
x x
x' x
t t t
t
s
+ +
=
Tanimoto distance
Pattern Classification, Chapter 10
42
Clustering as optimization
The second issue: how to evaluate a partitioning of
a set into clusters?
Clustering can be posed as an optimization of a
criterion function
The sumofsquarederror criterion and its variants
Scatter criteria
The sumofsquarederror criterion
Let n
i
the number of samples in D
i
, and m
i
the mean of
those samples
e
=
i
D
i
i
n
x
x m
1
Pattern Classification, Chapter 10
43
The sum of squared error is defined as
This criterion defines clusters by their mean vectors m
i
it minimizes the sum of the squared lengths of the error x  m
i
.
The minimum variance partition minimizes J
e
Results:
Good when clusters form well separated compact clouds
Bad with large differences in the number of samples in different
clusters.
= e
=
c
i D
i e
i
J
1
2
x
m x
Pattern Classification, Chapter 10
44
Scatter criteria
Scatter matrices used in multiple discriminant analysis,
i.e., the withinscatter matrix S
W
and the between
scatter matrix S
B
S
T
= S
B
+S
W
Note:
S
T
does not depend on partitioning
In contrast, S
B
and S
W
depend on partitioning
Two approaches:
minimize the withincluster
maximize the betweencluster scatter
Pattern Classification, Chapter 10
45
The trace (sum of diagonal elements) is the
simplest scalar measure of the scatter matrix
proportional to the sum of the variances in the
coordinate directions
This is the sumofsquarederror criterion, J
e
.
   
e
c
i D
i
c
i
i W
J S tr S tr
i
= = =
= e = 1
2
1 x
m x
Pattern Classification, Chapter 10
46
As tr[S
T
] = tr[S
W
] + tr[S
B
] and tr[S
T
] is independent from
the partitioning, no new results can be derived by
minimizing tr[S
B
]
However, seeking to minimize the withincluster criterion
J
e
=tr[S
W
], is equivalent to maximise the betweencluster
criterion
where m is the total mean vector:
 
=
=
c
i
i i B
n S tr
1
2
m m
=
= =
c
i
i i
D
n
n n
1
1 1
m x m
Pattern Classification, Chapter 10
47
Iterative optimization
Clustering discrete optimization problem
Finite data set finite number of partitions
What is the cost of exhaustive search?
c
n
/c! For c clusters. Not a good idea
Typically iterative optimization used:
starting from a reasonable initial partition
Redistribute samples to minimize criterion function.
guarantees local, not global, optimization.
Pattern Classification, Chapter 10
48
consider an iterative procedure to minimize the
sumofsquarederror criterion J
e
where J
i
is the effective error per cluster.
Moving sample from cluster D
i
to D
j
, changes
the errors in the 2 clusters by:
e =
= =
i
D
i i
c
i
i e
J J J
x
m x
2
1
where
2
1
*
j
j
j
j j
n
n
J J m x
+
+ =
x
1
*
i
i
i
i i
n
n
J J m x
=
Pattern Classification, Chapter 10
49
Hence, the transfer is advantegeous if the
decrease in J
i
is larger than the increase in J
j
2
2
1
j
j
j
i
i
i
n
n
n
n
m x m x
+
>
>
=
otherwise 0
) , ( if 1
0
s s
s
j i
ij
x x
Pattern Classification, Chapter 10
64
This matrix induces a similarity graph, dual to S, in
which nodes corresponds to points and edge joins
node i and j iff s
ij
=1.
Singlelinkage alg.: two samples x and x are in
the same cluster if there exists a chain x, x
1
, x
2
,
, x
k
, x, such that x is similar to x
1
, x
1
to x
2
, and
so on connected components of the graph
Completelink alg.: all samples in a given cluster
must be similar to one another and no sample can
be in more than one cluster.
Neirestneighbor algorithm is a method to find the
minimum spanning tree and vice versa
Removal of the longest edge produce a 2cluster
grouping, removal of the next longest edge produces a
3cluster grouping, and so on.
Pattern Classification, Chapter 10
65
This is a divisive hierarchical procedure, and
suggest ways to dividing the graph in subgraphs
E.g., in selecting an edge to remove, comparing its
length with the lengths of the other edges incident the
nodes
Pattern Classification, Chapter 10
66
One useful statistic to be estimated from the
minimal spanning tree is the edge length
distribution
For instance, in the case of 2 dense cluster
immersed in a sparse set of points: