9.913 Pattern Recognition For Vision: Class VII, Part I - Techniques For Clustering Yuri Ivanov

9.
913 Pattern Recognition for Vision
Class VII, Part I – Techniques for Clustering

Yuri Ivanov
Fall 2004
TOC
• Similarity metric
• K-means and IsoData algorithms
• EM algorithm
• Some hierarchical clustering schemes
Fall 2004 Pattern Recognition for Vision

Clustering
• Clustering is a process of partitioning the data into groups based on

similarity
• Clusters are groups of measurements that are similar
• In Classification groups of similar data form classes

– Labels are given
– Similarity is deduced from labels
• In Clustering groups of similar data form clusters

– Similarity measure is given
– Labels are deduced from similarity

Clustering
Labels given
y βy
Classification
Scaling
x αx
Labels deduced
y βy
Scaling
Clustering
x αx

Questions
• What is “similar”?
• What is a “good” partitioning?

Distances
Most obvious: distance between samples

• Compute distances between the samples
• Compare distances to a threshold
We need a metric to define distances and thresholds

Metric and Invariance
• We can choose it from a family:
1/ q
�d
q�
d ( x, x ' ) = � � xk - x 'k � - Minkowski metric
Ł k =1 ł
q = 1 � Manhattan/city block/taxicab distance

q = 2 � Euclidean distance
d(x, x’) is invariant to rotation and translation only for q = 2

Minkovski Metrics
L¥
L1 � { q = 1} L2
L3
L4
L10
Points a distance 1 from origin

Metric and Invariance
Other choices for invariant metric:
• We can use data-driven metric:
d ( x, x ' ) = ( x - x ') S -1
( x - x ')
T
- Mahalanobis distance
• We can normalize data (whiten)
x ' = ( L -1 / 2 F T ) x
And then use the Euclidean metric

Metric
Euclidean metric
• Good for isotropic spaces
• Bad for linear transformations (except rotation and translation)
Mahalanobis metric:
• Good if there is enough data
Whitening:
• Good if the spread is due to random processes
• Bad if it is due to subclasses

Similarity
We need a symmetric function that is large for “similar” x
xt x '
E.g.: s ( x, x ') = t - “angular” similarity
x x'
Vocabulary:
{Two, three, little, star, monkeys, jumping, twinke, bed }
a) Three little monkeys jumping on the bed (0, 1, 1, 0, 1, 1, 0, 1)
b) Two little monkeys jumping on the bed (1, 0, 1, 0, 1, 1, 0, 1)

c) Twinkle twinkle little star (0, 0, 1, 1, 0, 0, 2, 0)
a b c
a 1.0 0.8 0.18
Similarity matrix:
b
0.8 1.0
0.18
c
0.18 0.18 1.0
Similarity
It doesn’t have to be metric:
E.g.: Has fur Has 4 legs Can type

Monkey 1 0 1
Platypus 1 1 0
xt x '
s ( x, x ' ) =
.67 .33
.33 .67
d
xt x '
s ( x, x ' ) = t
1 .33
+ t
-
x x x' x' x x't .33 1
Tanimoto coefficient
Partitioning Evaluation
J – objective function, s.t. clustering is assumed optimal when J

is minimized or maximized
K Nk
J = �� x
2
(k )
n - mk - Sum of squared error criterion (min)
k =1 n =1
Using the definition of the mean:

1 K Ø 1 Nk Nk
ø
J = � Nk ��
2
Œ 2 x (k )
n -x (k )
m œ
2 k =1 º Nk n =1 m =1 ß
Dissimilarity measure
You can replace it with your
favorite
Partitioning Evaluation
Other possibilities:
For within- and between- cluster scatter matrices (recall LDA)
K
J = SW = �S
k =1
k - Scatter determinant criterion (min)
d
J = tr SW-1 S B = � li - Scatter ratio criterion (max)
i =1
Careful with the ranks!

Which to choose?
• No methodological answer
• SSE criterion (minimum variance)

– simple
– good for well separated clusters in dense groups
– affected by outliers, scale variant
• Scatter criteria
– Invariant to general linear transformations
– Poor on small amounts of data as related to dimensionality
• You should chose the metric and the objective that are invariant to
the transformations natural to your problem

Clustering
x – input data
K – number of clusters (assumed known)
Nk – number points in cluster k
N – total number of data points
tk – prototype (template) vector of k-th cluster
J – objective function, s.t. clustering is assumed optimal when J

is extremized

General Procedure
Clustering is usually an iterative procedure:
• Choose initial configuration

• Adjust configuration s.t. J is optimized
• Check for convergence
J is often only partially minimized.

Clustering – A Good Start
Let’s choose the following model:
• Known number of clusters

• Each cluster is represented by a single prototype
• Similarity is defined in the nearest neighbor sense
Sum-Squared-Error objective:
K Nk
J = �� x
2
(k)
n - tk - total in-cluster distance for all clusters
k =1 n =1
( )
K Nk Nk
dJ d
= �� = - 2� ( xn( k ) - tk ) = 0 �
2
xn( k ) - t k
dtk c =1 n =1 dtk n =1
Nk
1
tk =
Nk
� n
x (k )
n =1
K-Means Algorithm
Using the iterative procedure:
1. Choose M random positions for the prototypes

2. Classify all samples by the nearest tk
3. Compute new prototype positions
4. If not converged (no cluster assignments changed from previous
iteration), go to step 2
This is the K-Means (a.k.a. Lloyd’s, a.k.a. LBG) algorithm.
What to do with empty clusters? Some heuristics are involved.

K-Means Algorithm Example
K = 10

Cluster Heuristics
Sometimes clusters end up empty. We can:

• Remove them
• Randomly reinitialize them
• Split the largest ones
Sometimes we have too many clusters. We can:

• Remove the smallest ones
• Relocate the smallest ones
• Merge the smallest ones together if they are neighbors

IsoData Algorithm
In K-Means we assume that we know the number of clusters
IsoData tries to estimate them – ultimate K-Means hack
IsoData iterates between 3 stages:

• Center estimation
• Cluster splitting
• Cluster merging
The user specifies:

T – min number of samples in a cluster
ND – desired number of clusters Dm – max distance for merging
σS2 – maximum cluster variance Nmax – max number of merges
IsoData
Stage I – Cluster assignment:
1. Assign a label to each data point such that:

w n = arg min x n - t j
j
2. Discard clusters with Nk < T, reduce Nc
3. Update means of remaining clusters:

Nj
1
tj =
Nj
� i
x ( j)
i =1
This is basically a step of K-Means algorithm

IsoData
Stage II – Cluster splitting:

1. If this is the last iteration, set Dm=0 and go to Stage III
2. If Nc<=ND/2, go to splitting (step 4)
3. If iteration is even or if Nc>=2ND go to Stage III
4. Compute:
Nk
1
dk =
Nk
�
i =1
xi( k ) - t k - avg. distance from the center
�(x - tk , j ) - max variance along a single

Nk
1 2
s k = max
2 (k )
i, j
j Nk i =1 dimension
Nc
1
d=
N
�N d
k =1
k k - overall avg. distance from centers

IsoData
Stage II – Cluster splitting (cont.):
5. For clusters with σk2 > σS2:

If ( dk>d AND Nk>2(T+1) ) OR Nc<ND /2
Split the cluster by creating a new mean:

t 'k , j = tk , j + 0.5s k2
And moving the old one to:

t k , j = tk , j - 0.5s k2

IsoData
Stage III – Cluster merging:
If no split has been made:

1. Compute the matrix of distances between cluster centers
Di , j = ti - t j
2. Make the list of pairs where Di , j < Dm
3. Sort them in ascending order
4. Merge up to Nmax unique pairs starting from the top by

removing tj and replacing ti with:
ti =
1
Ni + N j
( N i ti + N j t j )

IsoData Example
ND = 10
T = 10
σS 2 = 3
Dm = 2
Nmax = 3

Mixture Density Model
Mixture model – a linear combination of parametric densities
Number of components
M
p ( x) = � p ( x | j ) P( j)
j =1
Component Component
density weight
M
P( j ) ‡ 0, "j and � P( j) = 1
j =1
Recall Kernel density estimation

Kernels are parametric densities, subject to estimation
Example
M
p ( x) = � p ( x | j ) P( j)
j =1

Mixture Density
Using ML principle, the objective function is the log-likelihood:
� N � N
� M
�
l (q ) ” ln �� p( x ) � = � ln � � p ( x | j ) P ( j ) �
n n
� n =1 � n =1 � j =1 �
Diffirentiate w.r.t. parameters:

N
¶ �M �
�q j l (q ) = � ln �� p( x | k )P (k ) �
n
n =1 ¶q j � k =1 �
N
1 ¶
=� p ( x n | j )P ( j )
M
¶q j
n =1
� | k )P (k )
p (
k =1
x n
Here because of the log

Mixture Density
For distributions p(x| j) in the exponential family:

¶ B (q , x ) ¶ ¶
Øº A(q )e B (q , x )
øß = A(q )e [ B (q , x ) ] + [ ]
A(q ) e B (q , x )
¶q ¶q ¶q
Goes in here Goes in here Goes in here
¶l (q ) N
� = � P ( j | x ) · (Stuff +More Stuff )
n
¶q n=1
For a Gaussian:
¶l (q ) N
= � P( j | x ) ºS j ( x - mˆ j ) øß
n
Ø -1 n
¶m j n =1
¶l (q ) N
= � P( j | x ) º S j - S j ( x - m j )( x - m j ) S j øß
n
Ø ˆ -1 ˆ -1 n
ˆ n
ˆ T ˆ -1
¶Sˆ j n =1

Mixture Density
At the extremum of the objective: N
N
� P ( j | x n
) x n
P( j ) =
1
� P ( j | x n
) mˆ j = n =1
N
N n =1
� P (
n =1
j | x n
)
( ) ( - mˆ j )
N
�
T
P ( j | x n
) x n
- m
ˆ j x n
Sˆ j = n =1
N
� P (
n =1
j | x n
)
BUT:
p ( x n | j ) P( j)
P( j | xn ) = M
- parameters are tied
� p
k =1
( x n
| k ) P (k )
Solution – EM algorithm.
EM Algorithm
Suppose we pick an initial configuration (just like in K-Means)
Recall the objective (change of sign):
� N �
E ” -l (q ) = - ln �� p( xn ) � = -� ln { p ( x n )}
N
� n =1 � n =1
After a single step of optimization:

N
� p new ( x n ) �
E new
-E old
= -� ln � old n �
n =1 � p (x ) �
N � M P new ( j ) p new ( xn | j ) �
= -� ln �� old n �
n =1 � j =1 p (x ) �

EM Algorithm
After optimization step:

N � M P new ( j ) p new ( x n | j ) �
E new
-E old
= -� ln �� old n �
n=1 � j =1 p (x ) �
=1
N � M Ø P new ( j ) p new ( x n | j ) Pold ( j | xn ) ø �

= -� ln �� Œ n œ�
� j =1 º P ( j | x ) ß�
old n old
n =1 p (x )
� M Ø old
N
P new
( j ) p new
( x n
| j) ø�
= -� ln �� Œ P ( j | x ) old n old
n
n œ�
n =1 � j =1 º p ( x ) P ( j | x ) ß�
�M �
Sums to 1 over j ln �� l j y j �
� j =1 �
Digression-Convexity
Definition: Function f is convex on [a, b] iff for any x1, x2 in [a, b]

and any λ in [0, 1]:
f ( l x1 + (1 - l ) x2 ) £ l f ( x1 ) + (1 - l ) f ( x2 )
f(x)
λf(x1)+(1-λ)f(x2) f(x)
f(x)
f(λx1+(1-λ)x2)
x1 λx1+(1-λ) x2 x2

Digression - Jensen’s Inequality
If f is a convex function:
�M � M
f � � l j x j � £ � l j f (x j )
Ł j =1 ł j =1
"l : 0 £ l j £ 1, � l j = 1
j
Equivalently: f ( E[ x ]) £ E[ f ( x)]
� 1 M � 1 M
Or: f� � xj � £ � f (x )
j
ŁM j =1 ł M j =1
Flip the inequality if f is concave

Digression - Jensen’s Inequality
Proof by induction:
a) JE is trivially true for any 2 points (definition of convexity)
b) Assuming it is true for any k-1 points:
for li* � li /(1 - lk )
k k -1
� i i k k
l f
i =1
( x ) = l f ( x ) + (1 - lk � i f ( xi )
) l *
i =1
� k -1 * �
‡ lk f ( xk ) + (1 - lk ) f � � li xi �
Ł i =1 ł
k -1
� � � k
�
‡ f � lk xk + (1 - lk )� li xi � = f � � li xi �
*
Ł i =1 ł Ł i =1 ł
End of digression
Back to EM
Change in the error: �M �

ln �� l j y j �
E new - E old = � j =1 �
N � M Ø old P new
( j ) p new
( x n
| j) ø�
= -� ln �� Œ P ( j | x ) old n old
n
n œ�
n =1 � j =1 º p ( x ) P ( j | x ) ß�
λ
by Jensen’s inequality:
N M
� P new
( j ) p new
( x n
| j) �
£ -�� P ( j | x ) ln � old n old
old n
n �
n=1 j =1 � p (x )P ( j | x ) �
� l ln { y }
M
j j
j =1

Back to EM
Change in the error: �M �

ln �� l j y j �
E new - E old = � j =1 �
N � M Ø old P new
( j ) p new
( x n
| j) ø�
= -� ln �� Œ P ( j | x ) old n old
n
n œ�
n =1 � j =1 º p ( x ) P ( j | x ) ß�
λ
by Jensen’s inequality:
N M
� P new
( j ) p new
( x n
| j) �
£ -�� P ( j | x ) ln � old n old
old n
n �
n=1 j =1 � p (x )P ( j | x ) �
call this “Q”

EM as Upper Bound Minimization
Then: E new £ E old + Q - upper bound on Enew(θ new)
Some observations:
• Q is convex
• Q is a function of new parameters θ new
• So is Enew
• If θ new = θ old then Enew = Eold +Q
E (q new )
E old + Q(q new ) Step downhill in Q leads
E old E new (q new ) downhill in Enew !!!
q new
q old

EM Iteration
E (q new ) Given initial θ minimize Q

E old + Q(q new )
E old E new (q new )
q new
q old
E (q new )
E old + Q(q new ) E new (q new )

Compute new Eold +Q
E old q new
q old
EM (cont.)
N M
� P new
( j ) p new
( x n
| j) �
Q = -�� P ( j | x ) ln � old n old
old n
n �
n =1 j =1 � p ( x )P ( j | x ) �
Can drop these
Q� = -�� P old ( j | x n ) ln {P new ( j) p new ( x n | j )}

N M
n =1 j =1
for a Gaussian mixture:
{ }
= -�� P old ( j | x n ) ln P new ( j ) - ln ( G j ( xn ) )
N M
n =1 j =1
As before – differentiate, set to 0, solve for parameter.

EM (cont.)
Straight-forward for means and covariances:

N
� P old
( j | x n
) x n
m̂ j = n =1
N
- convex sum, weighted w.r.t.
� old n previous estimate

P ( j | x )
n =1
( j | x ) ( x - mˆ j ) ( x - mˆ j )
N
�P old n n n T
Sˆ j = n =1
N
- convex sum, weighted w.r.t.
� P
n =1
old
( j | x n
) previous estimate

EM (cont.)
Need to enforce sum-to-one constraint for P(j):
� � M �
J P = Q + l � � P ( j ) - 1�
new
Ł j =1 ł
¶ P old ( j | x n )
N
J P = -� +l = 0
¶P ( j)
new
n =1
new
P ( j)
N
� l P new ( j ) = � P old ( j | x n )
n =1
M N M
� l � P new ( j) = �� Pold ( j | x n )
j =1 n=1 j =1
N
1
� l=N � P new ( j ) = � P old ( j | x n )
N n =1
EM Example
Nc = 3

EM Illustration
m1 P ( j = 1| x ) P(j|x) tells how much the data point

affects each cluster, unlike in K-
means.
m2
m3
P ( j = 2 | x) P( j = 3 | x)
labeled
m1 point
You can manipulate P(j|x). m2

Eg: Partially labeled data m3
unlabeled
point

EM vs K-Means
Furthermore, P(j|x) can be replaced with: m1 P ( j = 1| x )
g P ( j |x )
P� ( j | x) =
P ( j | x ) e m2
m3
� P ( k | x ) e g P (k |x )
k g =0 P( j = 3 | x)
P ( j = 2 | x)
if g = 0, P� ( j | x) = P ( j | x)
m1
Now let’s relax g :
m2
lim P�( j | x) = d ( P ( j | x),max P( j | x)) m3
g ﬁ¥
This is K-Means!!!
Hierarchical Clustering
Ex: Dendrogram Dissimilarity
T =3
3
There are 2 ways to do it:
• Agglomerative (bottom-up) T =2
2
• Divisive (top-down)
1
x1 … x6
Different thresholds induce different cluster configurations.
Stopping criterion – either a number of clusters, or a distance threshold

Hierarchical Agglomerative Clustering
General structure:
Initialize: K , Kˆ ‹ N , Dn ‹ xn , n = 1..N
do Kˆ ‹ Kˆ - 1
i, j = argmin d ( Dl , Dm )
l ,m
merge( Di , D j )
until K̂ == K Need to specify
Ex: d = d
mean ( Di , D j ) = m i - m j
Each induces
d = dmin ( Di , D j ) = min x1 - x2 different
x1˛Di , x2 ˛D j
algorithm
d = dmax ( Di , D j ) = max x1 - x2
x1˛Di , x2 ˛D j
Single Linkage Algorithm
Choosing d = dmin results in a Nearest Neighbor Algorithm (a.k.a

single linkage algorithm, a.k.a. minimum algorithm)
N=2
Each cluster is a minimal spanning tree of the data in the cluster.
Identifies clusters that are well separated
Complete Linkage Algorithm
Choosing d = dmax results in a Farthest Neighbor Algorithm (a.k.a.

complete linkage algorithm, a.k.a. maximum algorithm)
N=2
Each cluster is a complete subgraph of the data.
Identifies clusters that are well localized
Summary
• General concerns about choice of similarity metric

• K-means algorithm – simple but relies on Euclidean distances
• IsoData – old-school step towards model selection
• EM – “statistician’s K-means” – simple, general and convenient
• Some hierarchical clustering schemes

9.913 Pattern Recognition For Vision: Class VII, Part I - Techniques For Clustering Yuri Ivanov

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

9.913 Pattern Recognition For Vision: Class VII, Part I - Techniques For Clustering Yuri Ivanov

Hochgeladen von

Copyright:

Verfügbare Formate

9.

913 Pattern Recognition for Vision

Class VII, Part I – Techniques for Clustering

Fall 2004 Pattern Recognition for Vision

• Clustering is a process of partitioning the data into groups based on

• Clusters are groups of measurements that are similar

• In Classification groups of similar data form classes

• In Clustering groups of similar data form clusters

Fall 2004 Pattern Recognition for Vision

Fall 2004 Pattern Recognition for Vision

Fall 2004 Pattern Recognition for Vision

Most obvious: distance between samples

We need a metric to define distances and thresholds

Fall 2004 Pattern Recognition for Vision

• We can choose it from a family:

q = 1 � Manhattan/city block/taxicab distance

d(x, x’) is invariant to rotation and translation only for q = 2

Fall 2004 Pattern Recognition for Vision

Points a distance 1 from origin

Fall 2004 Pattern Recognition for Vision

Other choices for invariant metric:

• We can use data-driven metric:

• We can normalize data (whiten)

Fall 2004 Pattern Recognition for Vision

Fall 2004 Pattern Recognition for Vision

We need a symmetric function that is large for “similar” x

{Two, three, little, star, monkeys, jumping, twinke, bed }

a) Three little monkeys jumping on the bed (0, 1, 1, 0, 1, 1, 0, 1)

b) Two little monkeys jumping on the bed (1, 0, 1, 0, 1, 1, 0, 1)

It doesn’t have to be metric:

E.g.: Has fur Has 4 legs Can type

J – objective function, s.t. clustering is assumed optimal when J

Using the definition of the mean:

For within- and between- cluster scatter matrices (recall LDA)

Careful with the ranks!

Fall 2004 Pattern Recognition for Vision

• SSE criterion (minimum variance)

Fall 2004 Pattern Recognition for Vision

K – number of clusters (assumed known)

Nk – number points in cluster k

N – total number of data points

tk – prototype (template) vector of k-th cluster

J – objective function, s.t. clustering is assumed optimal when J

Fall 2004 Pattern Recognition for Vision

Clustering is usually an iterative procedure:

• Choose initial configuration

J is often only partially minimized.

Fall 2004 Pattern Recognition for Vision

Let’s choose the following model:

• Known number of clusters

Using the iterative procedure:

1. Choose M random positions for the prototypes

This is the K-Means (a.k.a. Lloyd’s, a.k.a. LBG) algorithm.

What to do with empty clusters? Some heuristics are involved.

Fall 2004 Pattern Recognition for Vision

Fall 2004 Pattern Recognition for Vision

Sometimes clusters end up empty. We can:

Sometimes we have too many clusters. We can:

Fall 2004 Pattern Recognition for Vision

In K-Means we assume that we know the number of clusters

IsoData tries to estimate them – ultimate K-Means hack

IsoData iterates between 3 stages: