Clustering Lecture

Unsupervised Clustering
Clustering is a very general problem that appears in

many different settings (not necessarily in a data
mining context)
Grouping similar products together to improve
the efficiency of a production line
Packing similar items into containers
Grouping similar customers together
Grouping similar stocks together
1
The Similarity Concept

Obviously, the concept of similarity is key
to clustering.
Using similarity definitions that are specific to a
domain may generate more acceptable clusters.
E.g. products that require same or similar
tools/processes in the production line are
similar.
Articles that are in the course pack of the same
course are similar.
General similarity measures are required for
general purpose algorithms.
2
Clustering: The K-Means Algorithm

(Lloyd, 1982)
1. Choose a value for K, the total number of
clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their
closest cluster center.
4. Calculate a new cluster center for each
cluster.
5. Repeat steps 3-5 until the cluster centers do
not change.
3
Distance Measure
The similarity is captured by a
distance measure in this algorithm.
The original proposed measure of
distance is the Euclidean distance.
X ( x1 , x2 , , x n ), Y ( y 1 , y 2 , , y n )
d ( x, y )
2
(
x
y
)
i i
i 1
An Example Using KMeans

Table 3.6
K-Means Input Values
Instance
1
2
3
4
5
6
1.0
1.0
2.0
2.0
3.0
5.0
1.5
4.5
1.5
3.5
2.5
6.0
f(x)
7
6
5
4
3
2
1
0
6
2
4
5
3
Table 3.7 Several Applications of the K-Means Algorithm (K = 2)

Outcome
Cluster Centers
Cluster Points
(2.67,4.67)
2, 4, 6
Squared Error
14.50
(2.00,1.83)
1, 3, 5
(1.5,1.5)
1, 3
15.94
(2.75,4.125)
2, 4, 5, 6
(1.8,2.7)
1, 2, 3, 4, 5
9.60
(5,6)
f(x)
7
6
5
4
3
2
1
0
General Considerations
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.
Requires real-valued data. Categorical data can be
converted into real, but the distance function needs to
be worked out carefully.
We must select the number of clusters present in the
data.
Data normalization may be required if attribute ranges
vary significantly.
Alternative distance measures may generate different
clusters.
9
K-means Clustering
Partitional clustering approach

Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
K-means Clustering Details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the cluster.

Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to Until relatively few points change clusters
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes
Two different K-means Clusterings

3
2.5
Original Points
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
-2
-1.5
-1
-0.5
0.5
1.5
Sub-optimal Clustering
Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5

Iteration 1
Iteration 2
1.5
1.5
1.5
2.5
2.5
2.5
0.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
Iteration 4
-1.5
-1
-0.5
0.5
1.5
-2
Iteration 5
1.5
1.5
1.5
0.5
0.5
0.5
-1
-0.5
0.5
1.5
-1
-0.5
0.5
1.5
1.5
Iteration 6
2.5
2.5
-1.5
-1.5
2.5
-2
Iteration 3
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest
cluster
To get SSE, we square these errors and sum them.
K
SSE dist 2 ( mi , x )
i 1 xCi
x is a data point in cluster Ci and mi is the representative

point for cluster Ci
can show that mi corresponds to the center (mean) of
the cluster
Given two clusters, we can choose the one with the
smallest error

Iteration 5
1
2
3
4
3
2.5
2
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5

Iteration 1
1.5
1.5
2.5
2.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
Iteration 3
Iteration 2
1.5
-2
-1.5
-1
-0.5
0.5
Iteration 4
1.5
1.5
1.5
0.5
0.5
0.5
-1
-0.5
0.5
1.5
Iteration 5
2.5
2.5
-1.5
1.5
2.5
-2
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
Problems with Selecting Initial Points
If there are K real clusters then the chance of selecting one centroid from each cluster is small.
Chance is relatively small when K is large

If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/10 10 = 0.00036

Sometimes the initial centroids will readjust themselves in right way, and sometimes they dont
Consider an example of five pairs of clusters
Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the
sequences of merges or splits
5
0.2
4
3
0.15
0.1
2
1
0.05
3
0
Strengths of Hierarchical
Clustering
Do not have to assume any particular
number of clusters
Any desired number of clusters can be
obtained by cutting the dendogram at the
proper level
They may correspond to meaningful

taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )
Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:
Start with the points as individual clusters

At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
Divisive:
Start with one, all-inclusive cluster

At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Traditional hierarchical algorithms use a similarity or

distance matrix
Merge or split one cluster at a time
Agglomerative Clustering
Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1.
2.
3.
4.
5.
6.
Compute the proximity matrix

Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the

proximity of two clusters
Different approaches to defining the distance

between clusters distinguish the different algorithms
Starting Situation
Start with clusters of individual points
p1 p2
p3
p4 p5
...
and a proximity matrix
p1
p2
p3
p4
p5
.
.
.
Proximity Matrix
Intermediate Situation
After some merging steps, we have some clusters
C1
C2
C3
C4
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
C5
Intermediate Situation
We want to merge the two closest clusters (C2 and C5)
and update the proximity matrix.
C1
C2
C3
C4
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2
C5
C5
After Merging
The question is How do we update the proximity
C2
matrix?
C1
C1
C2 U C5
C3
C4
U
C5
C3
C4
?
?
C3
C4
Proximity Matrix
C1
C2 U C5
How to Define Inter-Cluster Similarity

p1
Similarity?
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error
...

p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
function
...

p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
function
...

p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
function
...

p1
p2
p3
p4 p5
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average
.
Proximity Matrix
function
...
Cluster Similarity: MIN or

Single Link
Similarity of two clusters is based on

the two most similar (closest) points
in the different clusters
Determined by one pair of points, i.e., by
one link in the proximity graph.
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering: MIN

1
3
5
0.2
Nested Clusters
0.15
0.1
0.05
0
Dendrogram
Strength of MIN
Original Points
Can handle non-elliptical shapes
Two Clusters
Limitations of MIN
Original Points
Sensitive to noise and outliers
Two Clusters
Cluster Similarity: MAX or Complete

Linkage
Similarity of two clusters is based on

the two least similar (most distant)
points in the different clusters
Determined by all pairs of points in the
two clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering: MAX

4
2
5
0.4
0.35
0.3
0.25
6
1
Nested Clusters
0.2
0.15
0.1
0.05
0
Dendrogram
Strength of MAX
Original Points
Less susceptible to noise and outliers
Two Clusters
Limitations of MAX
Original Points
Tends to break large clusters
Biased towards globular clusters
Two Clusters
Cluster Similarity: Group Average

Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.
proximity(p ,p )
proximity(
Cluster
i , Cluster
j)
piCluster
i
pjClusterj
|Clusteri | |Clusterj |
Need to use average connectivity for scalability since

total proximity favors large clusters
I1
I2
I3
I4
I5
I1
1.00
0.90
0.10
0.65
0.20
I2
0.90
1.00
0.70
0.60
0.50
I3
0.10
0.70
1.00
0.40
0.30
I4
0.65
0.60
0.40
1.00
0.80
I5
0.20
0.50
0.30
0.80
1.00
Hierarchical Clustering: Group

Average
5
2
5
0.25
0.2
0.15
6
1
Nested Clusters
0.1
0.05
0
Dendrogram
Hierarchical Clustering: Group

Average
Compromise between Single and
Complete Link
Strengths
Limitations
Cluster Similarity: Wards

Method
Similarity of two clusters is based on the increase
in squared error when two clusters are merged
Similar to group average if distance between points is
distance squared

Hierarchical analogue of K-means
Can be used to initialize K-means
Hierarchical Clustering:
Comparison
5
1
MIN
MAX
5
Wards Method
2
5
Group Average
1
4
2
3
1
5
6
1
Hierarchical Clustering: Time and Space requirements
O(N2) space since it uses the proximity

matrix.
N is the number of points.
O(N3) time in many cases

There are N steps and at each step the size,
N2, proximity matrix must be updated and
searched
Complexity can be reduced to O(N 2 log(N) )
time for some approaches
Hierarchical Clustering: Problems and Limitations
Once a decision is made to combine two

clusters, it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or
more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and
convex shapes
Breaking large clusters

Clustering Lecture

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clustering Lecture

Hochgeladen von

Copyright:

Verfügbare Formate

Unsupervised Clustering

Clustering is a very general problem that appears in

The Similarity Concept

Clustering: The K-Means Algorithm

An Example Using KMeans

K-Means Input Values

Table 3.7 Several Applications of the K-Means Algorithm (K = 2)

Partitional clustering approach

K-means Clustering Details

Initial centroids are often chosen randomly.

Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster.

n = number of points, K = number of clusters,

Two different K-means Clusterings

Importance of Choosing Initial Centroids

Importance of Choosing Initial Centroids

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)

x is a data point in cluster Ci and mi is the representative

Importance of Choosing Initial Centroids

Importance of Choosing Initial Centroids

Problems with Selecting Initial Points

Chance is relatively small when K is large

For example, if K = 10, then probability = 10!/10 10 = 0.00036

They may correspond to meaningful

Start with the points as individual clusters

Start with one, all-inclusive cluster

Traditional hierarchical algorithms use a similarity or

Compute the proximity matrix

Key operation is the computation of the

Different approaches to defining the distance

How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity

How to Define Inter-Cluster Similarity

Cluster Similarity: MIN or

Similarity of two clusters is based on

Hierarchical Clustering: MIN

Can handle non-elliptical shapes

Sensitive to noise and outliers

Cluster Similarity: MAX or Complete

Similarity of two clusters is based on

Hierarchical Clustering: MAX

Less susceptible to noise and outliers

Cluster Similarity: Group Average

Need to use average connectivity for scalability since

Hierarchical Clustering: Group

Hierarchical Clustering: Group

Cluster Similarity: Wards

Less susceptible to noise and outliers

Hierarchical Clustering: Time and Space requirements

O(N2) space since it uses the proximity

O(N3) time in many cases

Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two

Das könnte Ihnen auch gefallen