Sie sind auf Seite 1von 6

Clustering Clustering

References:
Some slides from Data Mining by Margaret Dunham Xu and Wunsch survey on class web page Bishop book & Bilmes Gentle Intro for EM
1

Clustering is unsupervised learning, there are no class labels Want to find groups of similar instances Need a distance measure (usually Euclidean distance) for dissimilarity Can use cluster membership/distances as additional features
2

Clustering Houses

Clustering Problem
Given data D={x1 ,x2,,x n } of instances and an integer value k, the Clustering Problem is to define a mapping f:D{1,..,k} where each xi is assigned to one cluster Sj, 1<=j<=k. A Cluster, Sj, contains precisely those instances mapped to it. Unlike classification problem, clusters are not known a priori. Often uses a distance function between instances such as Euclidean distance over R n

Size Based Geographic Distance Based


3

2 main approaches:
Hierarchical (agglomerative/divisive)
Agglomerative merges closest instances/clusters Creates a tree of clusterings Cut at different levels to get different number of clusters

Hierarchical Clustering
Clusters are created in levels actually creating sets of clusters at each level. Agglomerative
Initially each item in its own cluster Iteratively clusters are merged together Bottom Up

Iterative, partitional (k-means, Expectationmaximization)


Often must pick number of clusters k in advance (AIC, BIC, Tibshirani Gap statistic can help) Usually iterate to improve clustering
5

Divisive
Initially all items in one cluster Large clusters are successively divided Top Down, pick large cluster to split

Dendrogram
Dendrogram: a tree data structure which illustrates hierarchical clustering techniques. Each level shows clusters for that level.
Leaf individual clusters Root one cluster

Single Link
Items with distances between them Distances below a threshold are links (edges) Clusters are maximal connected components Increasing threshold adds edges and merges clusters
7 8

A cluster at level i is the union of its children clusters at level i+1.

Single Link Example


A B C D E A B C D E 0 1 2 2 3 1 0 2 4 3 2 2 0 1 5 2 4 1 0 3 3 3 5 3 0

Cluster Merge Criteria


C Single Link: smallest distance between points Complete Link: largest distance between points Average Link: average distance between points Centroid: distance between centroids MST: use only distances in a min-cost spanning tree

D Threshold of 1 2 34 5

A B C D E
9 10

Partitional Algorithms
Nearest Neighbor K-Means Gaussian mixtures (EM) Many others

Nearest Neighbor
Items examined one at a time and usually put into the closest cluster Threshold, t, used to determine if items are added to existing clusters or a new cluster is created. Incremental/on-line, not iterative Depends on ordering of items
11 12

K-means clustering
1. Pick k starting means, 1 to k
Can use: randomly picked examples, perturbations of sample mean, or equally spaced along principle component

Tabular view of k-means


1 1 0 1 2 0 0 0 3 0 1 0 4 0 0 0

x1 x2 x3 M

2. Repeat until convergence: 1. Split data into k sets, S1 to Sk where x Si iff i closest mean to x 2. Update each i to mean of Si
13

14

Soft clustering
1 .6 .2 .4 2 .2 .1 .3 3 .1 .5 .1 4 .1 .2 .2

From Soft clustering to EM


Use weighted mean based on softclustering weights Soft cluster values are probabilities: P(cluster | x) Uses Bayes rule: P(cluster | x) prop. to P(x|cluster) P(cluster) For each x, the true cluster for x is a latent (unobserved) variable
15 16

x1 x2 x3 M

Soft clustering to EM 2
Assume parametric forms for
P(cluster) P(x| cluster) (multinomial) (e.g. Gaussian)

P(c) P(x|c) x1 x2 x3
17

1 .3 1 , 1
P(x1 |c)P(c) Norm(x1)

2 .2 2 ,
2

3 .2 3 , 3 .1 .5 .1

4 .3 4 , 4 .1 .2 .2
18

Iteratively:
1. Estimate the cluster latent variables based on data and old parameters, 2. Update parameters to maximize the likelihood assuming new estimates are truth

.2 .1 .3

.2 .4

This is the EM algorithm - for lots more see


http://citeseer.ist.psu.edu/bilmes98gentle.html

Expectation-Maximization (EM)
Log likelihood with a mixture model

E- and M-steps
Iterate the following two steps: E-step: Estimate dist. for z given X and current M-step: Find new given z est., X, and old .
E - step : Q ! | ! l = E LC (! | X, Z )| X, ! l M - step : ! l +1 = arg max Q ! | ! l
!

L (# | X ) = log" p x t | #
t k

) )

= !t log! p x t | Gi P ( i ) G
Gi a generative model - log of sum is tough to work with Assume hidden variables z t, which when known, make optimization much simpler (which Gi for x t) Complete likelihood, Lc ( |X,Z), in terms of x and z Incomplete likelihood, L( |X), in terms of x 19
i =1

) [

An increase in Q increases incomplete l +1 l likelihood L ! | X " L ! | X

) (

20

EM as likelihood ascent
new
Likelihood

EM in Gaussian Mixtures
zti = 1 if x t came from Gi, 0 otherwise; zt is all zti assume p(x|G i)~N( i,i)
E - step : Q ! | ! l = E LC (! | X, Z )| X, ! l

current Q

) [
!

M - step : ! l +1 = arg max Q ! | ! l

Lc (|X,Z) = t log(p(xt,zt) | ) Q(new|old) = Z ( t log(p(xt,zt) | old)) P(Z|X, old) : Parameters for mixture and all Gi
21 22

EM in Gaussian Mixtures
zti = 1 if x t belongs to Gi, 0 otherwise assume p(x|G i)~N( i,i) l xt P E-step: E [ it X ,# l ]= p ( | Gi , # ) (Gi ) z

" p (x
j

| Gj , #l P ( j ) G
l

= P Gi | x , # ! hit
t

M-step:

!h P ( )= G
i t

t i

mil +1
t

!hx = !h
t t t i t

t i

Use estimated labels in place of unknown labels

Sil +1 =

! h (x
t t i

"m

l +1 i t

)(x
t i

"m

l +1 T i

!h

23

24

Mixtures of Latent Variable Models


Regularize clusters Assume shared/diagonal covariance matrices Use PCA/FA to decrease dimensionality: Mixtures of PCA/FA
P(G 1|x)=h1=0.5

p (xt | Gi ) = N mi , Vi ViT + i

Can use EM to learn factors Vi (Ghahramani and Hinton, 1997; Tipping and Bishop, 1999)
25 26

After Clustering
Dimensionality reduction methods find correlations between features and group features Clustering methods find similarities between instances and group instances Allows knowledge extraction through number of clusters, prior probabilities, cluster parameters, i.e., center, range of features.
27

Clustering as Preprocessing
Estimated group labels h j (soft) or b j (hard) may be seen as the dimensions of a new k dimensional space, where we can then learn our discriminant or regressor. Local representation (only one b j is 1, all others are 0; only few h j are nonzero) vs Distributed representation (After PCA; all zj are nonzero)
28

Mixture of Mixtures
In classification, the input comes from a mixture of classes (supervised). If each class is also a mixture, e.g., of Gaussians, (unsupervised), we have a mixture of mixtures:
p (x | Ci ) = ! p ( | Gij ) ( ij ) x PG
j =1 K ki

Problems with EM
Local minima
Try it a couple of times Use good initialization (perhaps k-means)

Degenerate Gaussians - as goes to zero, the likelihood goes to


Fix a lower bound on

Lots of parameters to learn


Use spherical Gaussians or shared co-variance matrices (or even fixed distributions)

p (x ) = ! p (x | Ci )P ( i ) C
i =1

29

30

EM summary
Iterative method for maximizing likelihood General method - not just for Gaussian mixtures, but also HMMs, Bayes nets, etc. Generally works well, but can have local minima and degenerate situations Gets both clustering and distribution (mixture of Gaussians) - distributions can be used for Bayesian learning
31

Das könnte Ihnen auch gefallen