Sie sind auf Seite 1von 20

Lecture Notes

Part 22
Cluster Analysis
15.075/ESD.07J
Statistical Thinking and Data Analysis
Spring 2014
M.I.T.

Roy Welsch 2014


Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.
1

What is Cluster Analysis?

Finding groups of objects such that the objects in


a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups
Inter-cluster
Intra-cluster
distances are
minimized

Tan,Steinbach, Kumar

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

distances are
maximized

Notion of a Cluster can be Ambiguous

How many clusters?

Six Clusters

Two Clusters

Four Clusters

Tan,Steinbach, Kumar

Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and


partitional sets of clusters

Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

Hierarchical clustering
A set of nested clusters organized as a hierarchical tree

Tan,Steinbach, Kumar

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram

A tree like diagram that records the sequences of


merges or splits
5

6
0.2

4
3

0.15

5
2

0.1

0.05

3
0

Tan,Steinbach, Kumar

Clustering Algorithms
Hierarchical techniques.
Optimization techniques.

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Strengths of Hierarchical Clustering

Do not have to assume any particular number of


clusters
Any desired number of clusters can be obtained by
cutting the dendrogram at the proper level

They may correspond to meaningful taxonomies


Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, )

Tan,Steinbach, Kumar

Hierarchical Clustering

Two main types of hierarchical clustering


Agglomerative:

Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

Divisive:

Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains a point (or
there are k clusters)

Traditional hierarchical algorithms use a similarity


or distance matrix
Merge or split one cluster at a time

Tan,Steinbach, Kumar

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Hierarchical Techniques

Agglomerative hierarchical techniques more


common.
The most popular agglomerative techniques
1. Minimum distance (also called single linkage)
2. Maximum distance(also called complete
linkage)
3. Group average (also called average linkage)

Minimum Distance (Single Linkage)


The distance between two clusters is defined as the
distance between the nearest pair of objects with each
object in the pair belonging to a distinct cluster.
If cluster A is the set of objects a1, a2, . . . , am and cluster B
is b1, b2, . . . , bn, the single linkage distance between A and
B is,
min (distance (ai, bj) | i = 1, 2, . . . , m; j = 1, 2,. . . , n).
Tends to create string shaped clusters.
10

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Maximum Distance (Complete Linkage)


The distance between two clusters is defined as the
distance between the farthest pair of objects with each
object in the pair belonging to a distinct cluster.
If cluster A is the set of objects a1, a2, . . . , am and cluster
B is b1, b2, . . . , bn the single linkage distance between A
and B is,
max (distance (ai, bj) | i =1, 2, . . . , m; j = 1, 2,. . . , n).
Tends to create spherical shaped clusters.
11

Average Distance (Average Linkage)


Here the distance between two clusters is defined
as the average distance between all possible pairs
of objects with each object in the pair belonging to
a distinct cluster.
If cluster A is the set of objects a1, a2, . . . , am and
cluster B is b1, b2, . . . , bn the average linkage
distance between A and B is,
(1 / mn) distance (ai, bj)
the sum being taken over i =1, 2, . . . , m and j = 1,
2, . . . , n.
12

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Distance

Hierarchical Clustering

This produces a binary tree or


dendrogram

The final cluster is the root and


each data item is a leaf

The heights of the bars


indicate how close the items
are

Data items (genes, etc.)


2006 C. Burge - Copyright 2006 Massachusetts Institute of Technology. All Rights Reserved.

13

An Illustration: Public Utilities Data


Aim: To predict the cost impact of deregulation.
This would require building a detailed cost model
of the various utilities.
Considerable amount of time and effort would be
saved if we could cluster similar types of utilities
and build detailed cost models for just one
typical utility in each cluster.
Models can then be scaled up to estimate results
for all utilities.
14

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

The Data

15

Explanation of Variables

16

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Example of Distance Matrix


1

0.0

3.1

3.7

2.5

4.1

3.1

0.0

4.9

2.2

3.9

3.7

4.9

0.0

4.1

4.5

2.5

2.2

4.1

0.0

4.1

4.1

3.9

4.5

4.1

0.0

17

18

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

19

Dendrogram (Complete linkage)


7

Distance

18

14

19

Figure C: Dendogram: Complete Linkage for All 22 Utilities, Using All 8 Measurements

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

20

10

Sales & Fuel Cost:


3 rough clusters can be seen
High fuel cost, low sales

Low fuel cost, high sales

Low fuel cost, low sales

Relational representation gene expression data

samp1 sampl2 sampl3 sampl4 sampl _m


gene 1
gene 2
gene 3

gene_n
Label

x_11 x_12 x_13 x_14 x_1m


x_21 x_22 x_23 x_24 x_2m
x_31 x_32 x_33 x_34 x_3m
.
..
x_n1 x_n2 x_n3 x_n4 x_nm
P

n>m
22

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

11

Why cluster?
Cluster genes (rows)
Measure expression at multiple time-points,
different conditions, etc.
Similar expression patterns may suggest similar functions of genes

Cluster samples (columns)


e.g., expression levels of thousands of genes for
each tumor sample
Similar expression patterns may suggest biological
relationship among samples

2006 C. Burge - Copyright 2006 Massachusetts Institute of Technology. All Rights Reserved.

23

24

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

12

25

26

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

13

Validating Clusters
Interpretability
Summary statistics
Common features not used to cluster
Assign a label?
Cluster stability (over partitions)
Cluster partition A
Assign cases in partition B to clusters from A with
the closest centroid
Assess consistency based on clusters obtained
from all of the data
A type of cross-validation
27

Limitations of Hierarchical Clustering


Computation cost need the n n distance matrix
Only one pass through the data. Records incorrectly
allocated early cannot be relocated later
Poor stability
Single and complete linkage robust to choice of
distance matrix. Average linkage is not get rather
different clusters
Sensitive to outliers (means?)
28

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

14

Optimization Methods
A non-hierarchical approach to forming good
clusters.
Specify a desired number of clusters,say, k, and
assign each case to one of k clusters so as to minimize
a measure of dispersion within the clusters.
Common measure is the sum of squared Euclidean
distances from the mean of each cluster.
The optimization problem is difficult.
In practice, clusters are often computed using fast,
heuristic methods that generally produce good (but
not necessarily optimal) solutions. A very popular
(non-hierarchical) method is the k-Means algorithm.
29

K-Means
Greedy, local improvement heuristic for
minimizing within-cluster squared Euclidean
distances
Starting clusters required
Local minimum guaranteed
Very fast, in practice requires few iterations
Many variations
Not scale invariant, often needs normalization

30

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

15

k-Means Clustering Algorithm


Starts with k initial centers (how to choose k?).
At every step each record re-assigned to the
cluster with the closest centroid.
Recompute centroids of clusters that lost or gained
a record.
Stop when moving any more records between
clusters increases cluster dispersion.
Randomize starts?
31

k-Means Algorithm (contd.)


Generally the number of clusters in the data is not
known.
A good idea is to run the algorithm with different
values for k that are near the number of clusters one
expects from the data, to see how the sum of distances
reduces with increasing values of k.
The ratio of the sum of distances for a given k to the
sum of distances to the mean of all the cases (k = 1) is a
good measure for the usefulness of the clustering.
If the ratio is near 1.0 the clustering has not been very
effective, If the ratio is small, we have well separated
groups.
32

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

16

33

34

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

17

35

Problems with Selecting Initial Points

If there are K real clusters then the chance of selecting


one centroid from each cluster is small.

Chance is relatively small when K is large


If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/1010 = 0.00036


Sometimes the initial centroids will readjust themselves in
right way, and sometimes they dont

Tan,Steinbach, Kumar

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

36

18

Solutions to Initial Centroids Problem

Multiple runs
Helps, but probability is not on your side

Sample and use hierarchical clustering to


determine initial centroids
Select more than k initial centroids and then
select among these initial centroids

Select most widely separated

Tan,Steinbach, Kumar

37

How do we define similarity?


The goal is to group together similar data
but how to define similarity/distance between points
(or clusters)?

In general, depends on what we want to find or


emphasize in the data - clustering is an art

The similarity measure is often more important than


the clustering algorithm used
2006 C. Burge - Copyright 2006 Massachusetts Institute of Technology. All Rights Reserved.

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

38

19

Similarity Measures
Such measures can always be converted to
dissimilarity measures.
Sometimes it is more natural or convenient to
work with a similarity measure between cases
rather than distance which measures dissimilarity.
An example of a similarity measure would be the
square of a correlation coefficient (Pearson or
Spearman).

39

Similarity Measures (Binary data)


Suppose we have binary values for all the xijs and for
individuals i and j we have the following 22 table:

The most useful similarity measures in this situation are:


The matching coefficient, (a + d) / p
Jacquards coefficient, d / (b + c + d). This coefficient ignores
zero matches. This is desirable when we do not want to
consider two individuals to be similar when zero indicates a
feature that is not important.
40

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

20

Das könnte Ihnen auch gefallen