Beruflich Dokumente
Kultur Dokumente
Machine Learning
Brown University CSCI1420 & ENGN2520
Prof. Erik Sudderth
Lecture for Nov. 7, 2013:
K-Means Algorithm, Probabilistic Mixture Models,
Directed Graphical Models
260
260 260
240 240
240
220 220
220
200 200
200
weight
weight
weight
80 80 80
55 60 65 70 75 80 55 60 65 70 75 80 55 60 65 70 75 80
height height height
Clustering Evaluation: Rand Index
Unlike classification problems, comparison of automatic
clusterings to held-out “true” clusters is tricky:
• The number of assumed clusters may be different
• No correspondence between the true cluster labels, and the
arbitrary labels my algorithm uses to encode a clustering
Rand index computes following for all pairs of data points:
• False positive (FP): Target splits but algorithm clusters
• False negative (FN): Target clusters but algorithm splits
• True positive (TP): Algorithm and target both cluster together
• True negative (TN): Algorithm and target both split apart
Invariant to label choices, and takes time linear in N.
1 2 3
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (b)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (c)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (d)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (e)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (f)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (g)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (h)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (i)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
Reconstruction Error versus Iteration
Assignments
1000
K X
X N
J J(z, µ | x, K) = zik ||xi µk ||2
k=1 i=1
500
Means
Assignments
Means
0
1 2 3 4
C. Bishop, Pattern Recognition & Machine Learning
K-Means Implementation & Properties
K X
X N
2 zik = 1 if assigned to cluster k
J(z, µ | x, K) = zik ||xi µk || zik = 0 otherwise
k=1 i=1
z (t) = arg min J(z, µ (t 1)
| x, K) µ(t) = arg min J(z (t) , µ | x, K)
z µ
(0)
Initialization: Choose random cluster centers µ
• Should be distinct (breaking symmetry) and in “region” of data
• Common heuristic: Randomly pick K data points
• K-Means++: Randomly pick K widely separated data points
Theoretical Guarantees:
(t+1)
• Converges after finitely many iterations z = z (t)
• Worst-case convergence time poor (super-polynomial in N)
• Different initializations may produce very different solutions
• Converged objective may be arbitrarily worse than optimum,
but smart initializations (K-Means++) do allow some guarantees
• In practice, can usually still find “useful” local optima
• Optimal reconstruction error always decreases with K, 0 if K=N
Test Error versus K
MSE on test vs K for K−means
0.25
0.2
0.15
Data from a
mixture of 3
1D Gaussian
0.1 distributions
0.05
0
2 4 6 8 10 12 14 16
0.75
0.7
0.7
0.65
0.65
0.6
0.6
0.55
0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25
0.25
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
p(y | ⇡) = Cat(y | ⇡)
p(x | y = c, ✓) = N (x | µc , ⌃c ) ✓c = {µc , ⌃c }
• Joint probability distributions for Gaussian discriminant
analysis, and Gaussian mixture model, are identical!
• Difference is in interpretation and usage:
• For discriminant analysis, class labels y are training data
• For mixture model, cluster labels z are never observed
Directed Acyclic Graphs (DAGs)
X4
X2
X6
X1
X3 X5
Naïve Bayes:
Shading & Plate Notation
Plates denote
Xj replication of
random variables
D
X3 X5
x3
x1 0 1
0 1
0
0 x5
x3 1
1
⇡ ⇡
yi yt yi yt
✓ ✓ ✓ ✓
xi xt xi xt
N N
Train Test Train Test
Discriminative ML or MAP Learning:
N
X
max log p(✓) + log p(yi | xi , ✓)
✓
i=1
Unsupervised Learning
Clustering:
N
X
"
X
# ⇡
max log p(⇡) + log p(✓) + log p(zi | ⇡)p(xi | zi , ✓) zi
⇡,✓
i=1 zi
Dimensionality Reduction: ✓
N
X Z xi
max log p(⇡) + log p(✓) + log p(zi | ⇡)p(xi | zi , ✓) dzi N
⇡,✓ zi
i=1
✓ ✓ ✓
xi xt xi
N N
Supervised Supervised Unsupervised
Training Testing Learning
⇡, ✓ parameters (shared across instances)
z1 , . . . , z N hidden data (unique to particular instances)
• Initialization: Randomly select starting parameters
• Estimation: Given parameters, infer likely hidden data
• Similar to testing phase of supervised learning
• Learning: Given hidden & observed data, find likely parameters
• Similar to training phase of supervised learning
• Iteration: Alternate estimation & learning until convergence