You are on page 1of 49


Spring 2007, SJSU
Benjamin Lam
Definition of Clustering
Existing clustering methods
Clustering examples
Classification examples
Clustering can be considered the most important
unsupervised learning technique; so, as every
other problem of this kind, it deals with finding a
structure in a collection of unlabeled data.

Clustering is the process of organizing objects

into groups whose members are similar in some

A cluster is therefore a collection of objects which

are similar between them and are dissimilar
to the objects belonging to other clusters.

Mu-Yu Lu, SJSU

Why clustering?
A few good reasons ...

Pattern detection
Useful in data concept construction
Unsupervised learning process
Where to use clustering?
Data mining
Information retrieval
text mining
Web analysis
medical diagnostic
Which method should I
Type of attributes in data
Scalability to larger dataset
Ability to work with irregular data
Time cost
Data order dependency
Result presentation
Major Existing clustering
Measuring Similarity
Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function, which is
typically metric: d(i, j)
There is a separate quality function that
measures the goodness of a cluster.
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and data
It is hard to define similar enough or good
Professor Lee, Sin-Min
the answer is typically highly subjective.
Distance based method

In this case we easily identify the 4 clusters into which the data
can be divided; the similarity criterion is distance: two or more
objects belong to the same cluster if they are close according to
a given distance. This is called distance-based clustering.
Hierarchical clustering
Agglomerative (bottom Divisive (top down)
1. Start with a big cluster
1. start with 1 point 2. Recursively divide into
(singleton) smaller clusters
2. recursively add two or 3. Stop when k number of
more appropriate clusters is achieved.
3. Stop when k number
of clusters is
general steps of hierarchical
Given a set of N items to be clustered, and an N*N distance
(or similarity) matrix, the basic process of hierarchical
clustering (defined by S.C. Johnson in 1967) is this:
Start by assigning each item to a cluster, so that if you
have N items, you now have N clusters, each containing
just one item. Let the distances (similarities) between
the clusters the same as the distances (similarities)
between the items they contain.
Find the closest (most similar) pair of clusters and merge
them into a single cluster, so that now you have one
cluster less.
Compute distances (similarities) between the new
cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into K
number of clusters

Mu-Yu Lu, SJSU

Exclusive vs. non
exclusive clustering

In the first case data are grouped in an

exclusive way, so that if a certain
datum belongs to a definite cluster
then it could not be included in
another cluster. A simple example of
that is shown in the figure below,
where the separation of points is
achieved by a straight line on a bi-
dimensional plane.
On the contrary the second type, the
overlapping clustering, uses fuzzy sets
to cluster data, so that each point may
belong to two or more clusters with
different degrees of membership.
Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset
and relocate points between
clusters (opposite to visit-once
approach in Hierarchical approach)

This recursive relocation= higher quality cluster

Probabilistic clustering
1. Data are picked from mixture of
probability distribution.
2. Use the mean, variance of each
distribution as parameters for
3. Single cluster membership
The N*N proximity matrix is D =
The clusterings are assigned
sequence numbers 0,1,......, (n-1)
L(k) is the level of the kth clustering
A cluster with sequence number m is
denoted (m)
The proximity between clusters (r)
and (s) is denoted d [(r),(s)] Mu-Yu Lu, SJSU
The algorithm is composed of
the following steps:
Begin with the disjoint clustering having
level L(0) = 0 and sequence number m =

Find the least dissimilar pair of clusters in

the current clustering, say pair (r), (s),
according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of
clusters in the current clustering.
The algorithm is composed of the
following steps:(cont.)
Increment the sequence number : m = m +1.
Merge clusters (r) and (s) into a single cluster to
form the next clustering m. Set the level of this
clustering to
L(m) = d[(r),(s)]

Update the proximity matrix, D, by deleting the

rows and columns corresponding to clusters (r)
and (s) and adding a row and column
corresponding to the newly formed cluster. The
proximity between the new cluster, denoted (r,s)
and old cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]

If all objects are in one cluster, stop. Else, go to

Hierarchical clustering example
Lets now see a simple example: a hierarchical
clustering of distances in kilometers between
some Italian cities. The method used is single-
Input distance matrix (L = 0 for all the
The nearest pair of cities is MI and TO, at distance 138. These
are merged into a single cluster called "MI/TO". The level of
the new cluster is L(MI/TO) = 138 and the new sequence
number is m = 1.
Then we compute the distance from this new compound object
to all other objects. In single link clustering the rule is that
the distance from the compound object to another object is
equal to the shortest distance from any member of the
cluster to the outside object. So the distance from "MI/TO"
to RM is chosen to be 564, which is the distance from MI to
RM, and so on.
After merging MI with TO we obtain the
following matrix:
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a
new cluster called NA/RM
L(NA/RM) = 219
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM
into a new cluster called BA/NA/RM
L(BA/NA/RM) = 255
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM
and FI into a new cluster called BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
Finally, we merge the last two clusters at level 295.
The process is summarized by the following hierarchical tree:
K-mean algorithm
1. It acceptsthe number of clusters to group
data into, and the dataset to cluster as input
2. It then creates the first K initial clusters (K=
number of clustersneeded)from the dataset by
choosing K rows of data randomly from the
dataset. For Example, if there are 10,000 rows
of data in the dataset and 3 clusters need to be
formed, then the first K=3 initial clusters will
be created by selecting 3 records randomly
from the dataset as the initial clusters. Each of
the 3 initial clusters formed will have just one
row of data.
3. The K-Means algorithmcalculates the Arithmetic
Mean of each cluster formed in the dataset. The
Arithmetic Mean of acluster is the mean of all the
individualrecords in the cluster. In each of the first K initial
clusters,their is onlyone record. The Arithmetic Mean ofa
cluster with one record is the set of values that make up
that record. For Example if the dataset we are discussing
is a set of Height, Weight and Age measurements for
students in a University, where arecord P in the dataset S
is represented by a Height, Weight and Age
measurement,then P = {Age, Height, Weight).Then
arecord containing themeasurementsof a student John,
would be represented as John = {20, 170, 80} where
John's Age = 20 years, Height = 1.70 metres andWeight =
80 Pounds. Since there is only one record in each
initialcluster then the Arithmetic Mean of a clusterwith
only the record for John as a member = {20, 170, 80}.
4. Next, K-Means assigns each recordin the dataset to only one of the
initial clusters.Each record is assigned to the nearest cluster (the
cluster which it is most similar to) using a measure of distance or
similaritylike the Euclidean Distance Measure or Manhattan/City-
Block Distance Measure.
5. K-Meansre-assigns each record in thedatasettothe most
similarcluster andre-calculates the arithmetic mean of all the clusters
in the dataset. The arithmetic mean of a cluster is the arithmetic mean
of all the records in that cluster. For Example, ifa cluster contains two
recordswhere the recordof the set of measurements forJohn = {20,
170, 80} and Henry = {30, 160, 120}, thenthe arithmetic mean Pmean
is represented as Pmean= {Agemean, Heightmean, Weightmean).
Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2and Weightmean=
(80 + 120)/2. The arithmetic mean of this cluster = {25, 165,
100}. This new arithmetic mean becomes the center ofthis new
cluster. Following the same procedure, new cluster centers are
formed for all the existing clusters.
6. K-Means re-assigns each record in the dataset to only one of
the new clusters formed. A record or data point is assigned to
the nearest cluster (the cluster which it is most similar to)
using a measure of distance or similarity
7. The preceding steps are repeated until stable clusters are
formed and the K-Means clustering procedure is completed.
Stable clusters are formed when new iterations or repetitions
of the K-Means clustering algorithm does not create new
clusters as the cluster center or Arithmetic Mean of each
cluster formed is the same as the old cluster center. There
aredifferenttechniques fordetermining when a stable
cluster is formed or when the k-means clustering algorithm
procedure is completed.
Goal: Provide an overview of the
classification problem and introduce some of
the basic algorithms
Classification Problem Overview
Classification Techniques
Decision Trees
Neural Networks
Classification Examples
Teachers classify students grades
as A, B, C, D, or F.
Identify mushrooms as poisonous
or edible.
Predict when a river will flood.
Identify individuals with credit
Speech recognition
Pattern recognition
Classification Ex: Grading
If x >= 90 then
grade =A. <90 >=90

If 80<=x<90 then x A
grade =B.
<80 >=80
If 70<=x<80 then x B
grade =C.
If 60<=x<70 then <70 >=70
grade =D. x C
If x<50 then grade <50 >=60
=F. F D
Classification Techniques
1. Create specific model by
evaluating training data (or
using domain experts
2. Apply model developed to new
Classes must be predefined
Most common techniques use
DTs, NNs, or are based on
Defining Classes


Classification Using
Division: Use regression function
to divide area into regions.
Prediction: Use regression
function to predict a class
membership function. Input
includes desired class.
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
Classification Using
Place items in class to which they
are closest.
Must determine distance between
an item and a class.
Classes represented by
Centroid: Central value.
Medoid: Representative point.
Individual points
Algorithm: KNN
K Nearest Neighbor
Training set includes classes.
Examine K items near item to be
New item placed in class with the
most number of close items.
O(q) for each tuple to be classified.
(Here q is the size of the training
KNN Algorithm
Classification Using
Decision Trees
Partitioning based: Divide
search space into rectangular
Tuple placed into class based on
the region within which it falls.
DT approaches differ in how the
tree is built: DT Induction
Internal nodes associated with
attribute and arcs with values for
that attribute.
Decision Tree
D = {t1, , tn} where ti=<ti1, , tih>
Database schema contains {A1, A2, ,
Classes C={C1, ., Cm}
Decision or Classification Tree is a tree
associated with D such that
Each internal node is labeled with
attribute, Ai
Each arc is labeled with predicate which
can be applied to attribute at parent
Each leaf node is labeled with a class, C
DT Induction
Comparing DTs

Creates tree using information theory concepts and
tries to reduce expected number of comparison..
ID3 chooses split attribute with the highest
information gain using entropy as base for
very useful in data mining
applicable for both text and
graphical based data
Help simplify data complexity
detect hidden pattern in data
Dr. M.H. Dunham -
Dr. Lee, Sin-Min San Jose State University
Mu-Yu Lu, SJSU
Database System Concepts, Silberschatz, Korth,