Sie sind auf Seite 1von 35

Data Mining and Analytics:

Clustering
Anirban Mondal
anirban.mondal@snu.edu.in

What we are going to cover


What is clustering and why do we need to do
clustering?
Important Clustering algorithms
And finally, you will be given different
application scenarios
And you will need to figure out which clustering
algorithm is good for a given application scenario
Note: This requires you to understand not only the
steps in the algorithm, but also the limitations,
assumptions, applicability etc

What is clustering and why do we


need to do clustering?

Clustering
Clustering is about grouping similar objects/items
together
Within a cluster, items are more similar to each other than
with items outside the cluster

What does similar mean?


Objects are similar in some way i.e., based on some
property
If you change the property, the resulting clusters may
change too
There is a notion of similarity distance
Two or more objects belong to the same cluster if they are
close according to a given distance
Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function

Clustering
Clustering is one of the most important
unsupervised learning processes
Clustering finds structures in a collection of
unlabeled data
A separate quality function measures how
good the clustering is

Clustering
Input: a collection of n objects each
represented by a vector
Objective: to divide these n objects into k
clusters so that similar objects are grouped
together
In real-world settings, k is usually unknown

Example

Assume three people John, Jack and Alice


City of residence: John (NY), Jack (LA), Alice (NY)
Age: John (52), Jack (25), Alice (35)
Weight: John (70 kg), Jack (73 kg), Alice (50 kg)
Salary: John ($150 K), Jack ($40 K), Alice ($90 K)
Interests: John (music), Jack (karate), Alice (music)

Based on city of residence, what should be the clusters?


{John, Alice} and {Jack}

Example

Assume three people John, Jack and Alice


City of residence: John (NY), Jack (LA), Alice (NY)
Age: John (52), Jack (25), Alice (35)
Weight: John (70 kg), Jack (73 kg), Alice (50 kg)
Salary: John ($150 K), Jack ($40 K), Alice ($90 K)
Interests: John (music), Jack (karate), Alice (music)

Based on age, what should be the clusters?


{John}, {Alice}, {Jack}

Example

Assume three people John, Jack and Alice


City of residence: John (NY), Jack (LA), Alice (NY)
Age: John (52), Jack (25), Alice (35)
Weight: John (70 kg), Jack (73 kg), Alice (50 kg)
Salary: John ($150 K), Jack ($40 K), Alice ($90 K)
Interests: John (music), Jack (karate), Alice (music)

Based on weight, what should be the clusters?


{John, Jack} and {Alice}

Example

Assume three people John, Jack and Alice


City of residence: John (NY), Jack (LA), Alice (NY)
Age: John (52), Jack (25), Alice (35)
Weight: John (70 kg), Jack (73 kg), Alice (50 kg)
Salary: John ($150 K), Jack ($40 K), Alice ($90 K)
Interests: John (music), Jack (karate), Alice (music)

Based on salary, what should be the clusters?


{John}, {Jack}, {Alice}

Example

Assume three people John, Jack and Alice


City of residence: John (NY), Jack (LA), Alice (NY)
Age: John (52), Jack (25), Alice (35)
Weight: John (70 kg), Jack (73 kg), Alice (50 kg)
Salary: John ($150 K), Jack ($40 K), Alice ($90 K)
Interests: John (music), Jack (karate), Alice (music)

Based on interests, what should be the clusters?


{John, Alice} and {Jack}

Learning point: The result clusters


depend on what property you do
the clustering on
The property or way in which you cluster on is called
the dimension
In the example, dimensions were interests, salary, age,
city of residence etc

More learning points


The results of clustering are generally
application-dependent, and depend on how you
intend to use these results
Example: Refer once again to the example
Age: John (52), Jack (25), Alice (35)
Here, all three people could be in different market
segments
OR {Jack, Alice} could form a cluster
It all depends on the purpose of the clustering (in
this case, could be targeted marketing)

Think scalability
In this example, you could do the clustering
manually because the dataset was very small
What if you had to cluster 1 million people or
even 10000 people based on any one
dimension such as age range, interests etc?
Clustering algorithms are needed to achieve this

The K-Means Clustering algorithm

The K-means clustering algorithm


Inputs: The number K of clusters and the dataset to
cluster
Step 1: Randomly select k points as the initial cluster
centers/means
Step 2: Assign each point in the dataset to the closest
cluster (assign to only one cluster)
based upon the Euclidean distance between each point and
each cluster center (minimum distance)

Step 3: Recompute each cluster center as the average


of the points in that cluster
Repeat Steps 2 and 3 until the clusters
stabilize/converge.
Stabilize usually means that when steps 2 and 3 are
repeated, no changes occur in the clustering results

Example with K = 2
ID

X and Y are the two


attributes on which you
want to do the
clustering

Example with K = 2
ID

4
3

K=2, The points in red AND blue are randomly selected as your
two initial clusters

Example
ID

4
3

Compute Euclidean distance of point 3 from both points 1 and 2

Example
ID

4
3

d(p1,p3) = SQRT ((4-1)^2 + (3-1)^2) = SQRT(13)


d(p2, p3) = SQRT ((4-2)^2 + (3-1)^2) = SQRT(8)

Example
ID

Point 3 falls into the


blue cluster because
its distance to p2 is
less than its distance to
p1

4
3

d(p1,p3) = SQRT ((4-1)^2 + (3-1)^2) = SQRT(13)


d(p2, p3) = SQRT ((4-2)^2 + (3-1)^2) = SQRT(8)

Example
ID

Similarly, Point 4 falls


into the blue cluster
because its distance to
p2 is less than its
distance to p1

4
3

Example
ID

Now recompute each


clusters centre

4
3

Example
ID

The red clusters


centre is same as p1
because the red cluster
has only 1 point. In this
diagram, centres are
indicated by the cross
autoshape

4
3

Example
ID

The blue clusters


centre is the average
of the 3 points in blue.
In this diagram, cluster
centres are indicated
by the cross autoshape

4
3

Centre = (2+4+5)/3, (1+3+4)/3 i.e.,


(11/3, 8/3)

Example
ID

Now repeat step2 and


step 3:
Step 2: Assign each
point in the dataset to
the closest cluster

4
3

Example
ID

Step 3: Recompute
each cluster center as
the average of the
points in that cluster. In
this diagram, each
cluster centre is
indicated by an X

4
3

Example
ID

Now repeat steps 2


and 3 again
What do you see?

4
3

Example
ID

Now repeat steps 2


and 3 again

4
3

What do you see?


The clustering does not
change, hence the
algorithm terminates
(stabilizes)

Final result: {p1, p2} AND {p3, p4}

Pros and Cons of K-means


clustering algorithm
Pros
Simple
Converges to local optimum

Cons
The number K of clusters needs to be provided as an input, hence K
needs to be decided in advance
When dataset is relatively small, the initial clustering assignment has
significant influence on the final clustering results
The same dataset can produce different clusters, depending upon the
order of input
Each attribute is provided the same weightage, hence we cannot figure
out which attribute contributes how much to the clustering process
Observe that the algorithm essentially uses average (arithmetic mean)
Arithmetic mean does not work well with outliers. (Can use median if
outliers issue is significant)

Manhattan distance as similarity measure


In the example for the k-means algorithm, we
used the Euclidean distance as the similarity
measure
However, other distance measures such as the
Manhattan distance measure can also be used
The formula for Manhattan distance is as
follows:

Geometric explanation of Manhattan distance

Distance travelled from


one point to another if you
follow a grid-like path i.e.,
only along axis-aligned
directions
Contrast this with Euclidean distance, which measures the
ordinary distance between two points as in using a ruler

How good is the clustering?


One possible measure of how well the cluster
centres represent the members of their
clusters is the residual sum of squares
(RSS)
RSS = the squared distance of each vector from
its cluster centre summed over all vectors

Observe that the overall aim of the k-means


algorithm is to minimize the RSS

Termination conditions for the K-means algorithm


Stop after a given number of iterations
Pros: Shorter execution time
Cons: Low quality of the clustering if the number of
iterations is inadequate

Stop when no changes in clustering between


iterations
Pros: Good quality of clustering (except when initial
clusters are very bad)
Cons: Runtimes may be way too long!

Stop when the RSS falls below some threshold


To avoid unreasonably long runtimes, you may want
to put a cap on the number of iterations

How to do determine better initial clusters?


As we have seen, the k-means algorithm is very
sensitive to the initial clustering assignment
Spread the k initial cluster centres as far away as
you can

NOTE: In practice, the algorithm is usually very


fast, hence it is usual to execute it multiple
times with different initial clustering assignments
and also with different values of K to see which
gives the most desirable clustering result