Beruflich Dokumente
Kultur Dokumente
References:
Some slides from Data Mining by Margaret Dunham Xu and Wunsch survey on class web page Bishop book & Bilmes Gentle Intro for EM
1
Clustering is unsupervised learning, there are no class labels Want to find groups of similar instances Need a distance measure (usually Euclidean distance) for dissimilarity Can use cluster membership/distances as additional features
2
Clustering Houses
Clustering Problem
Given data D={x1 ,x2,,x n } of instances and an integer value k, the Clustering Problem is to define a mapping f:D{1,..,k} where each xi is assigned to one cluster Sj, 1<=j<=k. A Cluster, Sj, contains precisely those instances mapped to it. Unlike classification problem, clusters are not known a priori. Often uses a distance function between instances such as Euclidean distance over R n
2 main approaches:
Hierarchical (agglomerative/divisive)
Agglomerative merges closest instances/clusters Creates a tree of clusterings Cut at different levels to get different number of clusters
Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level. Agglomerative
Initially each item in its own cluster Iteratively clusters are merged together Bottom Up
Divisive
Initially all items in one cluster Large clusters are successively divided Top Down, pick large cluster to split
Dendrogram
Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level.
Leaf individual clusters Root one cluster
Single Link
Items with distances between them Distances below a threshold are links (edges) Clusters are maximal connected components Increasing threshold adds edges and merges clusters
7 8
D Threshold of 1 2 34 5
A B C D E
9 10
Partitional Algorithms
Nearest Neighbor K-Means Gaussian mixtures (EM) Many others
Nearest Neighbor
Items examined one at a time and usually put into the closest cluster Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. Incremental/on-line, not iterative Depends on ordering of items
11 12
K-means clustering
1. Pick k starting means, 1 to k
Can use: randomly picked examples, perturbations of sample mean, or equally spaced along principle component
x1 x2 x3 M
2. Repeat until convergence: 1. Split data into k sets, S1 to Sk where x Si iff i closest mean to x 2. Update each i to mean of Si
13
14
Soft clustering
1 .6 .2 .4 2 .2 .1 .3 3 .1 .5 .1 4 .1 .2 .2
x1 x2 x3 M
Soft clustering to EM 2
Assume parametric forms for
P(cluster) P(x| cluster) (multinomial) (e.g. Gaussian)
P(c) P(x|c) x1 x2 x3
17
1 .3 1 , 1
P(x1 |c)P(c) Norm(x1)
2 .2 2 ,
2
3 .2 3 , 3 .1 .5 .1
4 .3 4 , 4 .1 .2 .2
18
Iteratively:
1. Estimate the cluster latent variables based on data and old parameters, 2. Update parameters to maximize the likelihood assuming new estimates are truth
.2 .1 .3
.2 .4
Expectation-Maximization (EM)
Log likelihood with a mixture model
E- and M-steps
Iterate the following two steps: E-step: Estimate dist. for z given X and current M-step: Find new given z est., X, and old .
E - step : Q ! | ! l = E LC (! | X, Z )| X, ! l M - step : ! l +1 = arg max Q ! | ! l
!
L (# | X ) = log" p x t | #
t k
) )
= !t log! p x t | Gi P ( i ) G
Gi a generative model - log of sum is tough to work with Assume hidden variables z t, which when known, make optimization much simpler (which Gi for x t) Complete likelihood, Lc ( |X,Z), in terms of x and z Incomplete likelihood, L( |X), in terms of x 19
i =1
) [
) (
20
EM as likelihood ascent
new
Likelihood
EM in Gaussian Mixtures
zti = 1 if x t came from Gi, 0 otherwise; zt is all zti assume p(x|G i)~N( i,i)
E - step : Q ! | ! l = E LC (! | X, Z )| X, ! l
current Q
) [
!
Lc (|X,Z) = t log(p(xt,zt) | ) Q(new|old) = Z ( t log(p(xt,zt) | old)) P(Z|X, old) : Parameters for mixture and all Gi
21 22
EM in Gaussian Mixtures
zti = 1 if x t belongs to Gi, 0 otherwise assume p(x|G i)~N( i,i) l xt P E-step: E [ it X ,# l ]= p ( | Gi , # ) (Gi ) z
" p (x
j
| Gj , #l P ( j ) G
l
= P Gi | x , # ! hit
t
M-step:
!h P ( )= G
i t
t i
mil +1
t
!hx = !h
t t t i t
t i
Sil +1 =
! h (x
t t i
"m
l +1 i t
)(x
t i
"m
l +1 T i
!h
23
24
p (xt | Gi ) = N mi , Vi ViT + i
Can use EM to learn factors Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999)
25 26
After Clustering
Dimensionality reduction methods find correlations between features and group features Clustering methods find similarities between instances and group instances Allows knowledge extraction through number of clusters, prior probabilities, cluster parameters, i.e., center, range of features.
27
Clustering as Preprocessing
Estimated group labels h j (soft) or b j (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor. Local representation (only one b j is 1, all others are 0; only few h j are nonzero) vs Distributed representation (After PCA; all zj are nonzero)
28
Mixture of Mixtures
In classification, the input comes from a mixture of classes (supervised). If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:
p (x | Ci ) = ! p ( | Gij ) ( ij ) x PG
j =1 K ki
Problems with EM
Local minima
Try it a couple of times Use good initialization (perhaps k-means)
p (x ) = ! p (x | Ci )P ( i ) C
i =1
29
30
EM summary
Iterative method for maximizing likelihood General method - not just for Gaussian mixtures, but also HMMs, Bayes nets, etc. Generally works well, but can have local minima and degenerate situations Gets both clustering and distribution (mixture of Gaussians) - distributions can be used for Bayesian learning
31