A13 12 Clustering

Clustering Clustering
References:
Some slides from Data Mining by Margaret Dunham Xu and Wunsch survey on class web page Bishop book & Bilmes Gentle Intro for EM
1
Clustering is unsupervised learning, there are no class labels Want to find groups of similar instances Need a distance measure (usually Euclidean distance) for dissimilarity Can use cluster membership/distances as additional features
2
Clustering Houses
Clustering Problem
Given data D={x1 ,x2,,x n } of instances and an integer value k, the Clustering Problem is to define a mapping f:D{1,..,k} where each xi is assigned to one cluster Sj, 1<=j<=k. A Cluster, Sj, contains precisely those instances mapped to it. Unlike classification problem, clusters are not known a priori. Often uses a distance function between instances such as Euclidean distance over R n
Size Based Geographic Distance Based

3
2 main approaches:
Hierarchical (agglomerative/divisive)
Agglomerative merges closest instances/clusters Creates a tree of clusterings Cut at different levels to get different number of clusters
Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level. Agglomerative
Initially each item in its own cluster Iteratively clusters are merged together Bottom Up
Iterative, partitional (k-means, Expectationmaximization)

Often must pick number of clusters k in advance (AIC, BIC, Tibshirani Gap statistic can help) Usually iterate to improve clustering
5
Divisive
Initially all items in one cluster Large clusters are successively divided Top Down, pick large cluster to split
Dendrogram
Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level.
Leaf individual clusters Root one cluster
Single Link
Items with distances between them Distances below a threshold are links (edges) Clusters are maximal connected components Increasing threshold adds edges and merges clusters
7 8
A cluster at level i is the union of its children clusters at level i+1.
Single Link Example

A B C D E A B C D E 0 1 2 2 3 1 0 2 4 3 2 2 0 1 5 2 4 1 0 3 3 3 5 3 0
Cluster Merge Criteria

C Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids MST: use only distances in a min-cost spanning tree
D Threshold of 1 2 34 5
A B C D E
9 10
Partitional Algorithms
Nearest Neighbor K-Means Gaussian mixtures (EM) Many others
Nearest Neighbor
Items examined one at a time and usually put into the closest cluster Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. Incremental/on-line, not iterative Depends on ordering of items
11 12
K-means clustering
1. Pick k starting means, 1 to k
Can use: randomly picked examples, perturbations of sample mean, or equally spaced along principle component
Tabular view of k-means

1 1 0 1 2 0 0 0 3 0 1 0 4 0 0 0
x1 x2 x3 M
2. Repeat until convergence: 1. Split data into k sets, S1 to Sk where x Si iff i closest mean to x 2. Update each i to mean of Si
13
14
Soft clustering
1 .6 .2 .4 2 .2 .1 .3 3 .1 .5 .1 4 .1 .2 .2
From Soft clustering to EM

Use weighted mean based on softclustering weights Soft cluster values are probabilities: P(cluster | x) Uses Bayes rule: P(cluster | x) prop. to P(x|cluster) P(cluster) For each x, the true cluster for x is a latent (unobserved) variable
15 16
x1 x2 x3 M
Soft clustering to EM 2
Assume parametric forms for
P(cluster) P(x| cluster) (multinomial) (e.g. Gaussian)
P(c) P(x|c) x1 x2 x3
17
1 .3 1 , 1
P(x1 |c)P(c) Norm(x1)
2 .2 2 ,
2
3 .2 3 , 3 .1 .5 .1
4 .3 4 , 4 .1 .2 .2
18
Iteratively:
1. Estimate the cluster latent variables based on data and old parameters, 2. Update parameters to maximize the likelihood assuming new estimates are truth
.2 .1 .3
.2 .4
This is the EM algorithm - for lots more see

http://citeseer.ist.psu.edu/bilmes98gentle.html
Expectation-Maximization (EM)
Log likelihood with a mixture model
E- and M-steps
Iterate the following two steps: E-step: Estimate dist. for z given X and current M-step: Find new given z est., X, and old .
E - step : Q ! | ! l = E LC (! | X, Z )| X, ! l M - step : ! l +1 = arg max Q ! | ! l
!
L (# | X ) = log" p x t | #
t k
) )
= !t log! p x t | Gi P ( i ) G
Gi a generative model - log of sum is tough to work with Assume hidden variables z t, which when known, make optimization much simpler (which Gi for x t) Complete likelihood, Lc ( |X,Z), in terms of x and z Incomplete likelihood, L( |X), in terms of x 19
i =1
) [
An increase in Q increases incomplete l +1 l likelihood L ! | X " L ! | X
) (
20
EM as likelihood ascent
new
Likelihood
EM in Gaussian Mixtures
zti = 1 if x t came from Gi, 0 otherwise; zt is all zti assume p(x|G i)~N( i,i)
E - step : Q ! | ! l = E LC (! | X, Z )| X, ! l
current Q
) [
!
M - step : ! l +1 = arg max Q ! | ! l
Lc (|X,Z) = t log(p(xt,zt) | ) Q(new|old) = Z ( t log(p(xt,zt) | old)) P(Z|X, old) : Parameters for mixture and all Gi
21 22
EM in Gaussian Mixtures
zti = 1 if x t belongs to Gi, 0 otherwise assume p(x|G i)~N( i,i) l xt P E-step: E [ it X ,# l ]= p ( | Gi , # ) (Gi ) z
" p (x
j
| Gj , #l P ( j ) G
l
= P Gi | x , # ! hit
t
M-step:
!h P ( )= G
i t
t i
mil +1
t
!hx = !h
t t t i t
t i
Use estimated labels in place of unknown labels
Sil +1 =
! h (x
t t i
"m
l +1 i t
)(x
t i
"m
l +1 T i
!h
23
24
Mixtures of Latent Variable Models

Regularize clusters Assume shared/diagonal covariance matrices Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA
P(G 1|x)=h1=0.5
p (xt | Gi ) = N mi , Vi ViT + i
Can use EM to learn factors Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999)
25 26
After Clustering
Dimensionality reduction methods find correlations between features and group features Clustering methods find similarities between instances and group instances Allows knowledge extraction through number of clusters, prior probabilities, cluster parameters, i.e., center, range of features.
27
Clustering as Preprocessing
Estimated group labels h j (soft) or b j (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor. Local representation (only one b j is 1, all others are 0; only few h j are nonzero) vs Distributed representation (After PCA; all zj are nonzero)
28
Mixture of Mixtures
In classification, the input comes from a mixture of classes (supervised). If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:
p (x | Ci ) = ! p ( | Gij ) ( ij ) x PG
j =1 K ki
Problems with EM
Local minima
Try it a couple of times Use good initialization (perhaps k-means)
Degenerate Gaussians - as goes to zero, the likelihood goes to

Fix a lower bound on
Lots of parameters to learn

Use spherical Gaussians or shared co-variance matrices (or even fixed distributions)
p (x ) = ! p (x | Ci )P ( i ) C
i =1
29
30
EM summary
Iterative method for maximizing likelihood General method - not just for Gaussian mixtures, but also HMMs, Bayes nets, etc. Generally works well, but can have local minima and degenerate situations Gets both clustering and distribution (mixture of Gaussians) - distributions can be used for Bayesian learning
31

A13 12 Clustering

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A13 12 Clustering

Hochgeladen von

Copyright:

Verfügbare Formate

Clustering Clustering

Size Based Geographic Distance Based

Iterative, partitional (k-means, Expectationmaximization)

A cluster at level i is the union of its children clusters at level i+1.

Single Link Example

Cluster Merge Criteria

Tabular view of k-means

From Soft clustering to EM

This is the EM algorithm - for lots more see

An increase in Q increases incomplete l +1 l likelihood L ! | X " L ! | X

M - step : ! l +1 = arg max Q ! | ! l

Use estimated labels in place of unknown labels

Mixtures of Latent Variable Models

Degenerate Gaussians - as goes to zero, the likelihood goes to

Lots of parameters to learn

Das könnte Ihnen auch gefallen