Sie sind auf Seite 1von 46

Unsupervised Clustering

Clustering is a very general problem that appears in


many different settings (not necessarily in a data
mining context)
Grouping similar products together to improve
the efficiency of a production line
Packing similar items into containers
Grouping similar customers together
Grouping similar stocks together
1

The Similarity Concept


Obviously, the concept of similarity is key
to clustering.
Using similarity definitions that are specific to a
domain may generate more acceptable clusters.
E.g. products that require same or similar
tools/processes in the production line are
similar.
Articles that are in the course pack of the same
course are similar.
General similarity measures are required for
general purpose algorithms.
2

Clustering: The K-Means Algorithm


(Lloyd, 1982)
1. Choose a value for K, the total number of
clusters.
2. Randomly choose K points as cluster centers.
3. Assign the remaining instances to their
closest cluster center.
4. Calculate a new cluster center for each
cluster.
5. Repeat steps 3-5 until the cluster centers do
not change.
3

Distance Measure
The similarity is captured by a
distance measure in this algorithm.
The original proposed measure of
distance is the Euclidean distance.
X ( x1 , x2 , , x n ), Y ( y 1 , y 2 , , y n )
d ( x, y )

2
(
x

y
)
i i
i 1

An Example Using KMeans


Table 3.6

K-Means Input Values

Instance

1
2
3
4
5
6

1.0
1.0
2.0
2.0
3.0
5.0

1.5
4.5
1.5
3.5
2.5
6.0

f(x)

7
6
5
4
3
2
1
0

6
2
4
5
3

Table 3.7 Several Applications of the K-Means Algorithm (K = 2)


Outcome

Cluster Centers

Cluster Points

(2.67,4.67)

2, 4, 6

Squared Error
14.50

(2.00,1.83)

1, 3, 5

(1.5,1.5)

1, 3
15.94

(2.75,4.125)

2, 4, 5, 6

(1.8,2.7)

1, 2, 3, 4, 5
9.60

(5,6)

f(x)

7
6
5
4
3
2
1
0

General Considerations
Works best when the clusters in the data are of
approximately equal size.
Attribute significance cannot be determined.
Lacks explanation capabilities.
Requires real-valued data. Categorical data can be
converted into real, but the distance function needs to
be worked out carefully.
We must select the number of clusters present in the
data.
Data normalization may be required if attribute ranges
vary significantly.
Alternative distance measures may generate different
clusters.
9

K-means Clustering

Partitional clustering approach


Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple

K-means Clustering Details

Initial centroids are often chosen randomly.

Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster.


Closeness is measured by Euclidean distance, cosine similarity, correlation, etc.
K-means will converge for common similarity measures mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to Until relatively few points change clusters

Complexity is O( n * K * I * d )

n = number of points, K = number of clusters,


I = number of iterations, d = number of attributes

Two different K-means Clusterings


3
2.5

Original Points

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

2.5

2.5

1.5

1.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

Optimal Clustering

-2

-1.5

-1

-0.5

0.5

1.5

Sub-optimal Clustering

Importance of Choosing Initial Centroids


Iteration 6
1
2
3
4
5

3
2.5
2

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

Importance of Choosing Initial Centroids


Iteration 1

Iteration 2

1.5

1.5

1.5

2.5

2.5

2.5

0.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

1.5

-2

Iteration 4

-1.5

-1

-0.5

0.5

1.5

-2

Iteration 5

1.5

1.5

1.5

0.5

0.5

0.5

-1

-0.5

0.5

1.5

-1

-0.5

0.5

1.5

1.5

Iteration 6

2.5

2.5

-1.5

-1.5

2.5

-2

Iteration 3

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)


For each point, the error is the distance to the nearest
cluster
To get SSE, we square these errors and sum them.
K

SSE dist 2 ( mi , x )
i 1 xCi

x is a data point in cluster Ci and mi is the representative


point for cluster Ci
can show that mi corresponds to the center (mean) of
the cluster
Given two clusters, we can choose the one with the
smallest error

Importance of Choosing Initial Centroids


Iteration 5
1
2
3
4

3
2.5
2

1.5
1
0.5
0

-2

-1.5

-1

-0.5

0.5

1.5

Importance of Choosing Initial Centroids


Iteration 1

1.5

1.5

2.5

2.5

0.5

0.5

-2

-1.5

-1

-0.5

0.5

Iteration 3

Iteration 2

1.5

-2

-1.5

-1

-0.5

0.5

Iteration 4

1.5

1.5

1.5

0.5

0.5

0.5

-1

-0.5

0.5

1.5

Iteration 5

2.5

2.5

-1.5

1.5

2.5

-2

-2

-1.5

-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

Problems with Selecting Initial Points

If there are K real clusters then the chance of selecting one centroid from each cluster is small.

Chance is relatively small when K is large


If clusters are the same size, n, then

For example, if K = 10, then probability = 10!/10 10 = 0.00036


Sometimes the initial centroids will readjust themselves in right way, and sometimes they dont
Consider an example of five pairs of clusters

Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the
sequences of merges or splits
5

0.2

4
3

0.15

0.1

2
1

0.05

3
0

Strengths of Hierarchical
Clustering
Do not have to assume any particular
number of clusters
Any desired number of clusters can be
obtained by cutting the dendogram at the
proper level

They may correspond to meaningful


taxonomies
Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )

Hierarchical Clustering
Two main types of hierarchical clustering
Agglomerative:

Start with the points as individual clusters


At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left

Divisive:

Start with one, all-inclusive cluster


At each step, split a cluster until each cluster contains a point (or
there are k clusters)

Traditional hierarchical algorithms use a similarity or


distance matrix
Merge or split one cluster at a time

Agglomerative Clustering
Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1.
2.
3.
4.
5.
6.

Compute the proximity matrix


Let each data point be a cluster
Repeat
Merge the two closest clusters
Update the proximity matrix
Until only a single cluster remains

Key operation is the computation of the


proximity of two clusters

Different approaches to defining the distance


between clusters distinguish the different algorithms

Starting Situation
Start with clusters of individual points
p1 p2
p3
p4 p5
...
and a proximity matrix
p1
p2
p3
p4
p5
.
.
.

Proximity Matrix

Intermediate Situation
After some merging steps, we have some clusters
C1

C2

C3

C4

C1
C2
C3

C3
C4

C4
C5

Proximity Matrix
C1

C2

C5

C5

Intermediate Situation
We want to merge the two closest clusters (C2 and C5)
and update the proximity matrix.
C1

C2

C3

C4

C1
C2
C3

C3
C4

C4
C5

Proximity Matrix

C1

C2

C5

C5

After Merging
The question is How do we update the proximity
C2
matrix?
C1
C1
C2 U C5

C3
C4

U
C5

C3

C4

?
?

C3

C4

Proximity Matrix

C1

C2 U C5

How to Define Inter-Cluster Similarity


p1

Similarity?

p2

p3

p4 p5

p1
p2
p3
p4

p5

MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error

...

How to Define Inter-Cluster Similarity


p1

p2

p3

p4 p5

p1
p2
p3
p4

p5

MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error

...

How to Define Inter-Cluster Similarity


p1

p2

p3

p4 p5

p1
p2
p3
p4

p5

MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error

...

How to Define Inter-Cluster Similarity


p1

p2

p3

p4 p5

p1
p2
p3
p4

p5

MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error

...

How to Define Inter-Cluster Similarity


p1

p2

p3

p4 p5

p1

p2
p3
p4

p5

MIN
.
MAX
.
Group Average
.
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
Wards Method uses squared error

...

Cluster Similarity: MIN or


Single Link

Similarity of two clusters is based on


the two most similar (closest) points
in the different clusters
Determined by one pair of points, i.e., by
one link in the proximity graph.
I1
I2
I3
I4
I5

I1
1.00
0.90
0.10
0.65
0.20

I2
0.90
1.00
0.70
0.60
0.50

I3
0.10
0.70
1.00
0.40
0.30

I4
0.65
0.60
0.40
1.00
0.80

I5
0.20
0.50
0.30
0.80
1.00

Hierarchical Clustering: MIN


1

3
5

0.2

Nested Clusters

0.15

0.1
0.05
0

Dendrogram

Strength of MIN

Original Points

Can handle non-elliptical shapes

Two Clusters

Limitations of MIN

Original Points

Sensitive to noise and outliers

Two Clusters

Cluster Similarity: MAX or Complete


Linkage

Similarity of two clusters is based on


the two least similar (most distant)
points in the different clusters
Determined by all pairs of points in the
two clusters
I1
I2
I3
I4
I5

I1
1.00
0.90
0.10
0.65
0.20

I2
0.90
1.00
0.70
0.60
0.50

I3
0.10
0.70
1.00
0.40
0.30

I4
0.65
0.60
0.40
1.00
0.80

I5
0.20
0.50
0.30
0.80
1.00

Hierarchical Clustering: MAX


4

2
5

0.4
0.35

0.3
0.25

6
1

Nested Clusters

0.2
0.15
0.1
0.05
0

Dendrogram

Strength of MAX

Original Points

Less susceptible to noise and outliers

Two Clusters

Limitations of MAX

Original Points
Tends to break large clusters
Biased towards globular clusters

Two Clusters

Cluster Similarity: Group Average


Proximity of two clusters is the average of pairwise
proximity between points in the two clusters.

proximity(p ,p )

proximity(
Cluster
i , Cluster
j)

piCluster
i
pjClusterj

|Clusteri | |Clusterj |

Need to use average connectivity for scalability since


total proximity favors large clusters

I1
I2
I3
I4
I5

I1
1.00
0.90
0.10
0.65
0.20

I2
0.90
1.00
0.70
0.60
0.50

I3
0.10
0.70
1.00
0.40
0.30

I4
0.65
0.60
0.40
1.00
0.80

I5
0.20
0.50
0.30
0.80
1.00

Hierarchical Clustering: Group


Average
5

2
5

0.25
0.2

0.15

6
1

Nested Clusters

0.1
0.05
0

Dendrogram

Hierarchical Clustering: Group


Average
Compromise between Single and
Complete Link
Strengths
Less susceptible to noise and outliers

Limitations
Biased towards globular clusters

Cluster Similarity: Wards


Method
Similarity of two clusters is based on the increase
in squared error when two clusters are merged
Similar to group average if distance between points is
distance squared

Less susceptible to noise and outliers


Biased towards globular clusters
Hierarchical analogue of K-means
Can be used to initialize K-means

Hierarchical Clustering:
Comparison
5
1

MIN

MAX

5
Wards Method

2
5

Group Average

1
4

2
3

1
5

6
1

Hierarchical Clustering: Time and Space requirements

O(N2) space since it uses the proximity


matrix.
N is the number of points.

O(N3) time in many cases


There are N steps and at each step the size,
N2, proximity matrix must be updated and
searched
Complexity can be reduced to O(N 2 log(N) )
time for some approaches

Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two


clusters, it cannot be undone
No objective function is directly minimized
Different schemes have problems with one or
more of the following:
Sensitivity to noise and outliers
Difficulty handling different sized clusters and
convex shapes
Breaking large clusters

Das könnte Ihnen auch gefallen