Sie sind auf Seite 1von 30

Introduction to (Statistical)

Machine Learning
Brown University CSCI1420 & ENGN2520
Prof. Erik Sudderth
Lecture for Nov. 7, 2013:
K-Means Algorithm, Probabilistic Mixture Models,
Directed Graphical Models

Many figures courtesy Kevin Murphy’s textbook,


Machine Learning: A Probabilistic Perspective
Clustering Problems
•  Only input: Observed vectors xi 2 Rd , i = 1, 2, . . . , N
Can also cluster more complex data types.
•  Desired output: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N
Assign each observation to a single, unique cluster.
•  Common applications:
•  Compression (store clusters rather than raw data)
•  Prediction (new data expected to look like some cluster)
•  Visualization and understanding (by a human)
•  Best clustering (and K) depend on application goals
K=3
K=2 280
280 280

260
260 260

240 240
240

220 220
220

200 200
200

weight
weight
weight

180 180 180

160 160 160

140 140 140

120 120 120

100 100 100

80 80 80
55 60 65 70 75 80 55 60 65 70 75 80 55 60 65 70 75 80
height height height
Clustering Evaluation: Rand Index
Unlike classification problems, comparison of automatic
clusterings to held-out “true” clusters is tricky:
•  The number of assumed clusters may be different
•  No correspondence between the true cluster labels, and the
arbitrary labels my algorithm uses to encode a clustering
Rand index computes following for all pairs of data points:
•  False positive (FP): Target splits but algorithm clusters
•  False negative (FN): Target clusters but algorithm splits
•  True positive (TP): Algorithm and target both cluster together
•  True negative (TN): Algorithm and target both split apart
Invariant to label choices, and takes time linear in N.

1 2 3

Circles are proposed clusters,


letters are true cluster labels.
K-Means Objective: Compression
•  Observed feature vectors: x i 2 Rd , i = 1, 2, . . . , N

•  Hidden cluster centers: µ k 2 Rd , k = 1, 2, . . . , K

•  Hidden cluster assignments: zi , i = 1, 2, . . . , N

Integer encoding of assignments: zi 2 {1, 2, . . . , K}


N
X
J(z, µ | x, K) = ||xi µzi ||2
i=1
Indicator encoding of assignments: zik = 1 if assigned to cluster k
K X
X N zik = 0 otherwise
J(z, µ | x, K) = zik ||xi µk ||2
k=1 i=1
Problem: Minimization of K-means objective is NP-Hard
K-Means Algorithm
K X
X N
2 zik = 1 if assigned to cluster k
J(z, µ | x, K) = zik ||xi µk || zik = 0 otherwise
k=1 i=1
(0)
Initialization: Choose random cluster centers µ

Assignment Step: z (t) = arg min J(z, µ(t 1)


| x, K)
z
(t) (t 1) 2 (t 1) 2
zik = 1 if ||xi µk || < ||xi µ` || , ` 6= k
Assign data to closest cluster centers, breaking ties arbitrarily.

Mean Update Step: µ(t) = arg min J(z (t) , µ | x, K)


µ
N
X N
X
(t) 1 (t)
µk = (t)
zik xi Nk = zik
Nk i=1
i=1
Means of data assigned to each cluster center (least squares).
Undefined in sub-optimal configurations where clusters unused.
K-Means Algorithm
2 (a)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (b)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (c)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (d)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (e)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (f)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (g)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (h)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (i)

!2

!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
Reconstruction Error versus Iteration
Assignments
1000

K X
X N
J J(z, µ | x, K) = zik ||xi µk ||2
k=1 i=1

500
Means

Assignments
Means

0
1 2 3 4
C. Bishop, Pattern Recognition & Machine Learning
K-Means Implementation & Properties
K X
X N
2 zik = 1 if assigned to cluster k
J(z, µ | x, K) = zik ||xi µk || zik = 0 otherwise
k=1 i=1
z (t) = arg min J(z, µ (t 1)
| x, K) µ(t) = arg min J(z (t) , µ | x, K)
z µ
(0)
Initialization: Choose random cluster centers µ
•  Should be distinct (breaking symmetry) and in “region” of data
•  Common heuristic: Randomly pick K data points
•  K-Means++: Randomly pick K widely separated data points
Theoretical Guarantees:
(t+1)
•  Converges after finitely many iterations z = z (t)
•  Worst-case convergence time poor (super-polynomial in N)
•  Different initializations may produce very different solutions
•  Converged objective may be arbitrarily worse than optimum,
but smart initializations (K-Means++) do allow some guarantees
•  In practice, can usually still find “useful” local optima
•  Optimal reconstruction error always decreases with K, 0 if K=N
Test Error versus K
MSE on test vs K for K−means
0.25

0.2

0.15
Data from a
mixture of 3
1D Gaussian
0.1 distributions

0.05

0
2 4 6 8 10 12 14 16

For compressing new data, more codewords is always better.


Cross-validation fails for unsupervised learning!
Gaussian Mixture Models
•  Observed feature vectors: x i 2 Rd , i = 1, 2, . . . , N

•  Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N

•  Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K

•  Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K


XK
•  Hidden mixture probabilities: ⇡k , ⇡k = 1
k=1
•  Gaussian mixture generative model:
p(zi ) = Cat(zi | ⇡) p(xi | zi ) = Norm(xi | µzi , ⌃zi )
K
X
p(xi | ⇡, µ, ⌃) = ⇡zi Norm(xi | µzi , ⌃zi )
zi =1
Gaussian Mixture Models

0.75
0.7
0.7
0.65
0.65
0.6
0.6
0.55
0.55

0.5 0.5

0.45 0.45

0.4 0.4

0.35 0.35

0.3 0.3
0.25
0.25
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Mixture of 3 Gaussian Contour Plot of Joint Density,


Distributions in 2D Marginalizing Cluster Assignments
Gaussian Mixture Models

Surface Plot of Joint Density,


Marginalizing Cluster Assignments
Gaussian Discriminant Analysis
y class label in {1,…,C}, observed in training
x 2 Rd observed features to be used for classification
p(y, x | ⇡, ✓) = p(y | ⇡)p(x | y, ✓)
discriminant analysis prior likelihood
is a generative classifier! distribution function

p(y | ⇡) = Cat(y | ⇡)
p(x | y = c, ✓) = N (x | µc , ⌃c ) ✓c = {µc , ⌃c }
•  Joint probability distributions for Gaussian discriminant
analysis, and Gaussian mixture model, are identical!
•  Difference is in interpretation and usage:
•  For discriminant analysis, class labels y are training data
•  For mixture model, cluster labels z are never observed
Directed Acyclic Graphs (DAGs)
X4
X2
X6
X1

X3 X5

V set of N nodes or vertices, {1, 2, . . . , N }


E set of oriented edges (s,t) linking parents s to children t,
so that the set of parents of a node is

pa(t) = (t) = {s 2 V | (s, t) 2 E}

Xs = xs random variable associated with node s


Directed Graphical Models
Chain rule implies that any joint distribution equals:

Directed graphical model implies a restricted factorization:

nodes ! random variables


pa(t) ! parents with edges pointing to node t

Valid for any directed acyclic graph (DAG):


equivalent to dropping conditional
dependencies in standard chain rule
Name That Model

Naïve Bayes:
Shading & Plate Notation

Plates denote
Xj replication of
random variables
D

Naïve Bayes Inference:

Convention: Shaded nodes are observed, open nodes are latent/hidden


Parameterization & Representation
x 2
0 1
x1
0 1
x4 0
1
0 x5 1
x2 0
1 X4
0
x6
X2 1
0 1
x1
0 X6 x2
X1
1

X3 X5
x3
x1 0 1
0 1
0
0 x5
x3 1
1

Representational (storage, learning, computation) Complexity


•  Joint distribution: Exponential in number of variables
•  Directed graphical model: Exponential in number of
parents (“fan-in”) of each node, linear in number of nodes
Gaussian Mixture Models
•  Observed feature vectors: x i 2 Rd , i = 1, 2, . . . , N

•  Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N

•  Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K

•  Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K


XK
•  Hidden mixture probabilities: ⇡k , ⇡k = 1
k=1
•  Gaussian mixture generative model:
p(zi ) = Cat(zi | ⇡) p(xi | zi ) = Norm(xi | µzi , ⌃zi )
K
X
p(xi | ⇡, µ, ⌃) = ⇡zi Norm(xi | µzi , ⌃zi )
zi =1
Supervised Learning
Generative ML or MAP Learning:
N
X
max log p(⇡) + log p(✓) + [log p(yi | ⇡) + log p(xi | yi , ✓)]
⇡,✓
i=1

⇡ ⇡
yi yt yi yt

✓ ✓ ✓ ✓
xi xt xi xt
N N
Train Test Train Test
Discriminative ML or MAP Learning:
N
X
max log p(✓) + log p(yi | xi , ✓)

i=1
Unsupervised Learning
Clustering:
N
X
"
X
# ⇡
max log p(⇡) + log p(✓) + log p(zi | ⇡)p(xi | zi , ✓) zi
⇡,✓
i=1 zi

Dimensionality Reduction: ✓
N
X Z xi
max log p(⇡) + log p(✓) + log p(zi | ⇡)p(xi | zi , ✓) dzi N
⇡,✓ zi
i=1

•  No notion of training and test data: labels are never observed


•  As before, maximize posterior probability of model parameters
•  For hidden variables associated with each observation, we
marginalize over possible values rather than estimating
•  Fully accounts for uncertainty in these variables
•  There is one hidden variable per observation, so cannot
perfectly estimate even with infinite data
•  Must use generative model (discriminative degenerates)
Unsupervised Learning Algorithms
⇡ ⇡ ⇡
yi yt zi

✓ ✓ ✓
xi xt xi
N N
Supervised Supervised Unsupervised
Training Testing Learning
⇡, ✓ parameters (shared across instances)
z1 , . . . , z N hidden data (unique to particular instances)
•  Initialization: Randomly select starting parameters
•  Estimation: Given parameters, infer likely hidden data
•  Similar to testing phase of supervised learning
•  Learning: Given hidden & observed data, find likely parameters
•  Similar to training phase of supervised learning
•  Iteration: Alternate estimation & learning until convergence

Das könnte Ihnen auch gefallen