Beruflich Dokumente
Kultur Dokumente
Fall 2004
TOC
• Similarity metric
• K-means and IsoData algorithms
• EM algorithm
• Some hierarchical clustering schemes
Labels given
y βy
Classification
Scaling
x αx
Labels deduced
y βy
Scaling
Clustering
x αx
• What is “similar”?
• What is a “good” partitioning?
1/ q
�d
q�
d ( x, x ' ) = � � xk - x 'k � - Minkowski metric
Ł k =1 ł
L¥
L1 � { q = 1} L2
L3
L4
L10
d ( x, x ' ) = ( x - x ') S -1
( x - x ')
T
- Mahalanobis distance
x ' = ( L -1 / 2 F T ) x
And then use the Euclidean metric
Euclidean metric
• Good for isotropic spaces
• Bad for linear transformations (except rotation and translation)
Mahalanobis metric:
• Good if there is enough data
Whitening:
• Good if the spread is due to random processes
• Bad if it is due to subclasses
xt x '
E.g.: s ( x, x ') = t - “angular” similarity
x x'
Vocabulary:
Similarity matrix:
b
0.8 1.0
0.18
c
0.18 0.18 1.0
Fall 2004 Pattern Recognition for Vision
Similarity
xt x '
s ( x, x ' ) =
.67 .33
.33 .67
d
xt x '
s ( x, x ' ) = t
1 .33
+ t
-
x x x' x' x x't .33 1
Tanimoto coefficient
Fall 2004 Pattern Recognition for Vision
Partitioning Evaluation
K Nk
J = �� x
2
(k )
n - mk - Sum of squared error criterion (min)
k =1 n =1
Dissimilarity measure
You can replace it with your
favorite
Fall 2004 Pattern Recognition for Vision
Partitioning Evaluation
Other possibilities:
K
J = SW = �S
k =1
k - Scatter determinant criterion (min)
d
J = tr SW-1 S B = � li - Scatter ratio criterion (max)
i =1
• No methodological answer
• Scatter criteria
– Invariant to general linear transformations
– Poor on small amounts of data as related to dimensionality
• You should chose the metric and the objective that are invariant to
the transformations natural to your problem
x – input data
( )
K Nk Nk
dJ d
= �� = - 2� ( xn( k ) - tk ) = 0 �
2
xn( k ) - t k
dtk c =1 n =1 dtk n =1
Nk
1
tk =
Nk
� n
x (k )
n =1
Fall 2004 Pattern Recognition for Vision
K-Means Algorithm
K = 10
i =1
Di , j = ti - t j
2. Make the list of pairs where Di , j < Dm
ti =
1
Ni + N j
( N i ti + N j t j )
ND = 10
T = 10
σS 2 = 3
Dm = 2
Nmax = 3
Number of components
M
p ( x) = � p ( x | j ) P( j)
j =1
Component Component
density weight
M
P( j ) ‡ 0, "j and � P( j) = 1
j =1
M
p ( x) = � p ( x | j ) P( j)
j =1
� N � N
� M
�
l (q ) ” ln �� p( x ) � = � ln � � p ( x | j ) P ( j ) �
n n
� n =1 � n =1 � j =1 �
n =1 ¶q j � k =1 �
N
1 ¶
=� p ( x n | j )P ( j )
M
¶q j
n =1
� | k )P (k )
p (
k =1
x n
¶q ¶q ¶q
Goes in here Goes in here Goes in here
¶l (q ) N
� = � P ( j | x ) · (Stuff +More Stuff )
n
¶q n=1
For a Gaussian:
¶l (q ) N
= � P( j | x ) ºS j ( x - mˆ j ) øß
n
Ø -1 n
¶m j n =1
¶l (q ) N
= � P( j | x ) º S j - S j ( x - m j )( x - m j ) S j øß
n
Ø ˆ -1 ˆ -1 n
ˆ n
ˆ T ˆ -1
¶Sˆ j n =1
N
� P ( j | x n
) x n
P( j ) =
1
� P ( j | x n
) mˆ j = n =1
N
N n =1
� P (
n =1
j | x n
)
( ) ( - mˆ j )
N
�
T
P ( j | x n
) x n
- m
ˆ j x n
Sˆ j = n =1
N
� P (
n =1
j | x n
)
BUT:
p ( x n | j ) P( j)
P( j | xn ) = M
- parameters are tied
� p
k =1
( x n
| k ) P (k )
Solution – EM algorithm.
Fall 2004 Pattern Recognition for Vision
EM Algorithm
� N �
E ” -l (q ) = - ln �� p( xn ) � = -� ln { p ( x n )}
N
� n =1 � n =1
� M Ø old
N
P new
( j ) p new
( x n
| j) ø�
= -� ln �� Œ P ( j | x ) old n old
n
n œ�
n =1 � j =1 º p ( x ) P ( j | x ) ß�
�M �
Sums to 1 over j ln �� l j y j �
� j =1 �
Fall 2004 Pattern Recognition for Vision
Digression-Convexity
f ( l x1 + (1 - l ) x2 ) £ l f ( x1 ) + (1 - l ) f ( x2 )
f(x)
λf(x1)+(1-λ)f(x2) f(x)
f(x)
f(λx1+(1-λ)x2)
x1 λx1+(1-λ) x2 x2
If f is a convex function:
�M � M
f � � l j x j � £ � l j f (x j )
Ł j =1 ł j =1
"l : 0 £ l j £ 1, � l j = 1
j
Equivalently: f ( E[ x ]) £ E[ f ( x)]
� 1 M � 1 M
Or: f� � xj � £ � f (x )
j
ŁM j =1 ł M j =1
Proof by induction:
a) JE is trivially true for any 2 points (definition of convexity)
b) Assuming it is true for any k-1 points:
for li* � li /(1 - lk )
k k -1
� i i k k
l f
i =1
( x ) = l f ( x ) + (1 - lk � i f ( xi )
) l *
i =1
� k -1 * �
‡ lk f ( xk ) + (1 - lk ) f � � li xi �
Ł i =1 ł
k -1
� � � k
�
‡ f � lk xk + (1 - lk )� li xi � = f � � li xi �
*
Ł i =1 ł Ł i =1 ł
End of digression
Fall 2004 Pattern Recognition for Vision
Back to EM
N � M Ø old P new
( j ) p new
( x n
| j) ø�
= -� ln �� Œ P ( j | x ) old n old
n
n œ�
n =1 � j =1 º p ( x ) P ( j | x ) ß�
λ
by Jensen’s inequality:
N M
� P new
( j ) p new
( x n
| j) �
£ -�� P ( j | x ) ln � old n old
old n
n �
n=1 j =1 � p (x )P ( j | x ) �
� l ln { y }
M
j j
j =1
N � M Ø old P new
( j ) p new
( x n
| j) ø�
= -� ln �� Œ P ( j | x ) old n old
n
n œ�
n =1 � j =1 º p ( x ) P ( j | x ) ß�
λ
by Jensen’s inequality:
N M
� P new
( j ) p new
( x n
| j) �
£ -�� P ( j | x ) ln � old n old
old n
n �
n=1 j =1 � p (x )P ( j | x ) �
Some observations:
• Q is convex
• Q is a function of new parameters θ new
• So is Enew
• If θ new = θ old then Enew = Eold +Q
E (q new )
E old + Q(q new ) Step downhill in Q leads
E old E new (q new ) downhill in Enew !!!
q new
q old
q new
q old
E (q new )
E old q new
q old
Fall 2004 Pattern Recognition for Vision
EM (cont.)
N M
� P new
( j ) p new
( x n
| j) �
Q = -�� P ( j | x ) ln � old n old
old n
n �
n =1 j =1 � p ( x )P ( j | x ) �
Can drop these
n =1 j =1
{ }
= -�� P old ( j | x n ) ln P new ( j ) - ln ( G j ( xn ) )
N M
n =1 j =1
� P old
( j | x n
) x n
m̂ j = n =1
N
- convex sum, weighted w.r.t.
( j | x ) ( x - mˆ j ) ( x - mˆ j )
N
�P old n n n T
Sˆ j = n =1
N
- convex sum, weighted w.r.t.
� P
n =1
old
( j | x n
) previous estimate
� � M �
J P = Q + l � � P ( j ) - 1�
new
Ł j =1 ł
¶ P old ( j | x n )
N
J P = -� +l = 0
¶P ( j)
new
n =1
new
P ( j)
N
� l P new ( j ) = � P old ( j | x n )
n =1
M N M
� l � P new ( j) = �� Pold ( j | x n )
j =1 n=1 j =1
N
1
� l=N � P new ( j ) = � P old ( j | x n )
N n =1
Fall 2004 Pattern Recognition for Vision
EM Example
Nc = 3
g P ( j |x )
P� ( j | x) =
P ( j | x ) e m2
m3
� P ( k | x ) e g P (k |x )
k g =0 P( j = 3 | x)
P ( j = 2 | x)
if g = 0, P� ( j | x) = P ( j | x)
m1
Now let’s relax g :
m2
lim P�( j | x) = d ( P ( j | x),max P( j | x)) m3
g ޴
This is K-Means!!!
Fall 2004 Pattern Recognition for Vision
Hierarchical Clustering
T =3
3
There are 2 ways to do it:
• Agglomerative (bottom-up) T =2
2
• Divisive (top-down)
1
x1 … x6
General structure:
Initialize: K , Kˆ ‹ N , Dn ‹ xn , n = 1..N
do Kˆ ‹ Kˆ - 1
i, j = argmin d ( Dl , Dm )
l ,m
merge( Di , D j )
until K̂ == K Need to specify
Ex: d = d
mean ( Di , D j ) = m i - m j
Each induces
d = dmin ( Di , D j ) = min x1 - x2 different
x1˛Di , x2 ˛D j
algorithm
d = dmax ( Di , D j ) = max x1 - x2
x1˛Di , x2 ˛D j
Fall 2004 Pattern Recognition for Vision
Single Linkage Algorithm
N=2
Each cluster is a minimal spanning tree of the data in the cluster.
Identifies clusters that are well separated
Fall 2004 Pattern Recognition for Vision
Complete Linkage Algorithm
N=2
Each cluster is a complete subgraph of the data.
Identifies clusters that are well localized
Fall 2004 Pattern Recognition for Vision
Summary