You are on page 1of 9

K-Means Clustering

1
K-Means Clustering

The k-means clustering aims to partition n data into k sets.

Objective Function n=11

k 2

arg min ∑ ∑ Χ j − mi
S i =1 Χ j ∈S i

where mi is the mean of points in Si

Observation set : ( X 1 , X 2 ,..., X n ) n=11, k=3(S1,S2,S3)

partition the n observations into k sets S.

k < n, S={S1, S2, … , Sk}

http://en.wikipedia.org/wiki/K-means_clustering, 30 March 2010


2
K-Means Clustering

Basic idea
• proposed by Hugo Steinhaus in 1956
Standard Algorithm
• proposed by Stuart Lloyd in 1957
• for a pulse-code modulation technique
The term “K-means”
• proposed by James MacQueen in 1967

http://en.wikipedia.org/wiki/K-means_clustering, 30 March 2010


3
Standard Algorithm

Standard Algorithm (k-means algorithm, Lloyd’s algorithm)


Assignment:

initial set of k means: m1(1),…,m


mk(1)
(selected by a random or heuristic method)

Update:

calculate the new means


-> centroid of the objects in the cluster

repeat until stable


-> no objects move group

http://en.wikipedia.org/wiki/K-means_clustering, 30 March 2010


4
Standard Algorithm

1. Select k points which are initial centroids of groups.

2. Assign each object to a group of the closest centroid.

3. When all objects have been assigned, update k centroids.

4. Repeat step 2 and 3 until the centroids no longer move or


the objects no longer move to other groups.
Examples

Initial positions groups by initial 1st step


positions

2nd step 3rd step final step

K-means interactive demo, http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html, 30 March 2010


6
Examples-
Examples-Matlab
100 100

90 90
n=20, k=3
80 80

70 70
Operation flow
60 60 1. Select initial centroid
50 50
(random)
40 40 2. Calculate Euclidian
30 30 distance
20 20 3. Assign group (find
10 10 minimum distance)
0
0 10 20 30 40 50 60 70 80 90
0
0 10 20 30 40 50 60 70 80 90 4. Calculate position of
new centroid
Initial positions & 1st step
grouping 5. Calculate stop
100 100
condition
90 90

80 80

70 70

60 60

50 50

40 40

30 30

20 20

10 10 Matlab Statistics Toolbox


0
0 10 20 30 40 50 60 70 80 90
0
0 10 20 30 40 50 60 70 80 90 : IDX = KMEANS(X, K)
2nd step final step
7
Summary

K-Means clustering
• is a fast and simple algorithm
• to solve clustering problem
But the algorithm
• does not necessarily find optimal configuration
• due to initialization problem
• by random or heuristic selection
And so k-means algorithm
• can be run multiple times
• to reduce above effect.

8
References

Joaquin Perez Ortega, Ma. Del Rocio Boone Rojas, and Maria J.
Somodevilla Garica, “Research issues on K-means Algorithm:
An Experimental Trial Using Matlab”, Proceedings of the 2nd
Workshop on Semantic Web and New Technologies (SemWeb09),
Puebla, Mexico, March 23-24, 2009.