Introduction To (Statistical) Machine Learning

Introduction to (Statistical)
Machine Learning
Brown University CSCI1420 & ENGN2520
Prof. Erik Sudderth
Lecture for Nov. 7, 2013:
K-Means Algorithm, Probabilistic Mixture Models,
Directed Graphical Models
Many figures courtesy Kevin Murphy’s textbook,

Machine Learning: A Probabilistic Perspective
Clustering Problems
•  Only input: Observed vectors xi 2 Rd , i = 1, 2, . . . , N
Can also cluster more complex data types.
•  Desired output: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N
Assign each observation to a single, unique cluster.
•  Common applications:
•  Compression (store clusters rather than raw data)
•  Prediction (new data expected to look like some cluster)
•  Visualization and understanding (by a human)
•  Best clustering (and K) depend on application goals
K=3
K=2 280
280 280
260
260 260
240 240
240
220 220
220
200 200
200
weight
weight
weight
180 180 180
160 160 160
140 140 140
120 120 120
100 100 100
80 80 80
55 60 65 70 75 80 55 60 65 70 75 80 55 60 65 70 75 80
height height height
Clustering Evaluation: Rand Index
Unlike classification problems, comparison of automatic
clusterings to held-out “true” clusters is tricky:
•  The number of assumed clusters may be different
•  No correspondence between the true cluster labels, and the
arbitrary labels my algorithm uses to encode a clustering
Rand index computes following for all pairs of data points:
•  False positive (FP): Target splits but algorithm clusters
•  False negative (FN): Target clusters but algorithm splits
•  True positive (TP): Algorithm and target both cluster together
•  True negative (TN): Algorithm and target both split apart
Invariant to label choices, and takes time linear in N.
1 2 3
Circles are proposed clusters,

letters are true cluster labels.
K-Means Objective: Compression
•  Observed feature vectors: x i 2 Rd , i = 1, 2, . . . , N
•  Hidden cluster centers: µ k 2 Rd , k = 1, 2, . . . , K
•  Hidden cluster assignments: zi , i = 1, 2, . . . , N
Integer encoding of assignments: zi 2 {1, 2, . . . , K}

N
X
J(z, µ | x, K) = ||xi µzi ||2
i=1
Indicator encoding of assignments: zik = 1 if assigned to cluster k
K X
X N zik = 0 otherwise
J(z, µ | x, K) = zik ||xi µk ||2
k=1 i=1
Problem: Minimization of K-means objective is NP-Hard
K-Means Algorithm
K X
X N
2 zik = 1 if assigned to cluster k
J(z, µ | x, K) = zik ||xi µk || zik = 0 otherwise
k=1 i=1
(0)
Initialization: Choose random cluster centers µ
Assignment Step: z (t) = arg min J(z, µ(t 1)

| x, K)
z
(t) (t 1) 2 (t 1) 2
zik = 1 if ||xi µk || < ||xi µ` || , ` 6= k
Assign data to closest cluster centers, breaking ties arbitrarily.
Mean Update Step: µ(t) = arg min J(z (t) , µ | x, K)

µ
N
X N
X
(t) 1 (t)
µk = (t)
zik xi Nk = zik
Nk i=1
i=1
Means of data assigned to each cluster center (least squares).
Undefined in sub-optimal configurations where clusters unused.
K-Means Algorithm
2 (a)
!2
!2 0 2
C. Bishop, Pattern Recognition & Machine Learning
K-Means Algorithm
2 (b)
!2
!2 0 2
K-Means Algorithm
2 (c)
!2
!2 0 2
K-Means Algorithm
2 (d)
!2
!2 0 2
K-Means Algorithm
2 (e)
!2
!2 0 2
K-Means Algorithm
2 (f)
!2
!2 0 2
K-Means Algorithm
2 (g)
!2
!2 0 2
K-Means Algorithm
2 (h)
!2
!2 0 2
K-Means Algorithm
2 (i)
!2
!2 0 2
Reconstruction Error versus Iteration
Assignments
1000
K X
X N
J J(z, µ | x, K) = zik ||xi µk ||2
k=1 i=1
500
Means
Assignments
Means
0
1 2 3 4
K-Means Implementation & Properties
K X
X N
2 zik = 1 if assigned to cluster k
J(z, µ | x, K) = zik ||xi µk || zik = 0 otherwise
k=1 i=1
z (t) = arg min J(z, µ (t 1)
| x, K) µ(t) = arg min J(z (t) , µ | x, K)
z µ
(0)
Initialization: Choose random cluster centers µ
•  Should be distinct (breaking symmetry) and in “region” of data
•  Common heuristic: Randomly pick K data points
•  K-Means++: Randomly pick K widely separated data points
Theoretical Guarantees:
(t+1)
•  Converges after finitely many iterations z = z (t)
•  Worst-case convergence time poor (super-polynomial in N)
•  Different initializations may produce very different solutions
•  Converged objective may be arbitrarily worse than optimum,
but smart initializations (K-Means++) do allow some guarantees
•  In practice, can usually still find “useful” local optima
•  Optimal reconstruction error always decreases with K, 0 if K=N
Test Error versus K
MSE on test vs K for K−means
0.25
0.2
0.15
Data from a
mixture of 3
1D Gaussian
0.1 distributions
0.05
0
2 4 6 8 10 12 14 16
For compressing new data, more codewords is always better.

Cross-validation fails for unsupervised learning!
Gaussian Mixture Models
•  Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N
•  Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K
•  Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K

XK
•  Hidden mixture probabilities: ⇡k , ⇡k = 1
k=1
•  Gaussian mixture generative model:
p(zi ) = Cat(zi | ⇡) p(xi | zi ) = Norm(xi | µzi , ⌃zi )
K
X
p(xi | ⇡, µ, ⌃) = ⇡zi Norm(xi | µzi , ⌃zi )
zi =1
0.75
0.7
0.7
0.65
0.65
0.6
0.6
0.55
0.55
0.5 0.5
0.45 0.45
0.4 0.4
0.35 0.35
0.3 0.3
0.25
0.25
0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Mixture of 3 Gaussian Contour Plot of Joint Density,

Distributions in 2D Marginalizing Cluster Assignments
Surface Plot of Joint Density,

Marginalizing Cluster Assignments
Gaussian Discriminant Analysis
y class label in {1,…,C}, observed in training
x 2 Rd observed features to be used for classification
p(y, x | ⇡, ✓) = p(y | ⇡)p(x | y, ✓)
discriminant analysis prior likelihood
is a generative classifier! distribution function
p(y | ⇡) = Cat(y | ⇡)
p(x | y = c, ✓) = N (x | µc , ⌃c ) ✓c = {µc , ⌃c }
•  Joint probability distributions for Gaussian discriminant
analysis, and Gaussian mixture model, are identical!
•  Difference is in interpretation and usage:
•  For discriminant analysis, class labels y are training data
•  For mixture model, cluster labels z are never observed
Directed Acyclic Graphs (DAGs)
X4
X2
X6
X1
X3 X5
V set of N nodes or vertices, {1, 2, . . . , N }

E set of oriented edges (s,t) linking parents s to children t,
so that the set of parents of a node is
pa(t) = (t) = {s 2 V | (s, t) 2 E}
Xs = xs random variable associated with node s

Directed Graphical Models
Chain rule implies that any joint distribution equals:
Directed graphical model implies a restricted factorization:
nodes ! random variables

pa(t) ! parents with edges pointing to node t
Valid for any directed acyclic graph (DAG):

equivalent to dropping conditional
dependencies in standard chain rule
Name That Model
Naïve Bayes:
Shading & Plate Notation
Plates denote
Xj replication of
random variables
D
Naïve Bayes Inference:
Convention: Shaded nodes are observed, open nodes are latent/hidden

Parameterization & Representation
x 2
0 1
x1
0 1
x4 0
1
0 x5 1
x2 0
1 X4
0
x6
X2 1
0 1
x1
0 X6 x2
X1
1
X3 X5
x3
x1 0 1
0 1
0
0 x5
x3 1
1
Representational (storage, learning, computation) Complexity

•  Joint distribution: Exponential in number of variables
•  Directed graphical model: Exponential in number of
parents (“fan-in”) of each node, linear in number of nodes
•  Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N
•  Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K
•  Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K

XK
•  Hidden mixture probabilities: ⇡k , ⇡k = 1
k=1
•  Gaussian mixture generative model:
p(zi ) = Cat(zi | ⇡) p(xi | zi ) = Norm(xi | µzi , ⌃zi )
K
X
p(xi | ⇡, µ, ⌃) = ⇡zi Norm(xi | µzi , ⌃zi )
zi =1
Supervised Learning
Generative ML or MAP Learning:
N
X
max log p(⇡) + log p(✓) + [log p(yi | ⇡) + log p(xi | yi , ✓)]
⇡,✓
i=1
⇡ ⇡
yi yt yi yt
✓ ✓ ✓ ✓
xi xt xi xt
N N
Train Test Train Test
Discriminative ML or MAP Learning:
N
X
max log p(✓) + log p(yi | xi , ✓)
✓
i=1
Unsupervised Learning
Clustering:
N
X
"
X
# ⇡
max log p(⇡) + log p(✓) + log p(zi | ⇡)p(xi | zi , ✓) zi
⇡,✓
i=1 zi
Dimensionality Reduction: ✓
N
X Z xi
max log p(⇡) + log p(✓) + log p(zi | ⇡)p(xi | zi , ✓) dzi N
⇡,✓ zi
i=1
•  No notion of training and test data: labels are never observed

•  As before, maximize posterior probability of model parameters
•  For hidden variables associated with each observation, we
marginalize over possible values rather than estimating
•  Fully accounts for uncertainty in these variables
•  There is one hidden variable per observation, so cannot
perfectly estimate even with infinite data
•  Must use generative model (discriminative degenerates)
Unsupervised Learning Algorithms
⇡ ⇡ ⇡
yi yt zi
✓ ✓ ✓
xi xt xi
N N
Supervised Supervised Unsupervised
Training Testing Learning
⇡, ✓ parameters (shared across instances)
z1 , . . . , z N hidden data (unique to particular instances)
•  Initialization: Randomly select starting parameters
•  Estimation: Given parameters, infer likely hidden data
•  Similar to testing phase of supervised learning
•  Learning: Given hidden & observed data, find likely parameters
•  Similar to training phase of supervised learning
•  Iteration: Alternate estimation & learning until convergence

Introduction To (Statistical) Machine Learning

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Introduction To (Statistical) Machine Learning

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction to (Statistical)

Many figures courtesy Kevin Murphy’s textbook,

180 180 180

160 160 160

140 140 140

120 120 120

100 100 100

Circles are proposed clusters,

• Hidden cluster centers: µ k 2 Rd , k = 1, 2, . . . , K

• Hidden cluster assignments: zi , i = 1, 2, . . . , N

Integer encoding of assignments: zi 2 {1, 2, . . . , K}

Assignment Step: z (t) = arg min J(z, µ(t 1)

Mean Update Step: µ(t) = arg min J(z (t) , µ | x, K)

For compressing new data, more codewords is always better.

• Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N

• Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K

• Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K

Mixture of 3 Gaussian Contour Plot of Joint Density,

Surface Plot of Joint Density,

V set of N nodes or vertices, {1, 2, . . . , N }

pa(t) = (t) = {s 2 V | (s, t) 2 E}

Xs = xs random variable associated with node s

Directed graphical model implies a restricted factorization:

nodes ! random variables

Valid for any directed acyclic graph (DAG):

Naïve Bayes Inference:

Convention: Shaded nodes are observed, open nodes are latent/hidden

Representational (storage, learning, computation) Complexity

• Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N

• Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K

• Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K

• No notion of training and test data: labels are never observed

Das könnte Ihnen auch gefallen

•  Hidden cluster centers: µ k 2 Rd , k = 1, 2, . . . , K

•  Hidden cluster assignments: zi , i = 1, 2, . . . , N

•  Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N

•  Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K

•  Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K

•  Hidden cluster labels: zi 2 {1, 2, . . . , K}, i = 1, 2, . . . , N

•  Hidden mixture means: µ k 2 Rd , k = 1, 2, . . . , K

•  Hidden mixture covariances: ⌃k 2 Rd⇥d , k = 1, 2, . . . , K

•  No notion of training and test data: labels are never observed