Beruflich Dokumente
Kultur Dokumente
Cluster Analysis
• Goals of Cluster Analysis
• Distance Function
• K-Means Clustering
• Agglomerative Clustering
• Divisive Clustering
NBS 2016S1 AB1202 CCK-STAT-018
3
Distance Function
• A distance function is basically a formula applied on
two data points to give a decimal number (≥ 0).
• Remember that a data point may be
▫ a number (eg 2),
▫ a coordinate (eg (3, 4)), or
▫ a record of values (eg (20 years, 1.75 m, “Male”, 72 kg,
3.9999 GPA)).
• But to calculate distance, we convert all non-
numerical values into numerical values (see Coding
in Chapter 15).
▫ (20, 1.75, 1, 72, 3.9999) we call this “vector”
• So, not surprisingly, when we think of a data point 𝑝
in clustering, it is also a vector 𝑥1 , 𝑥2 , … , 𝑥𝑘 .
NBS 2016S1 AB1202 CCK-STAT-018
5
Distance Function
• Suppose we have 2 data points 𝑝1 = 𝑥1 , 𝑥2 , … , 𝑥𝑘 and
𝑝2 = 𝑦1 , 𝑦2 , … , 𝑦𝑘
• 3 kinds of distance functions are commonly used.
• Euclidean distance:
▫ dist 𝑝1 , 𝑝2 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑘 − 𝑦𝑘 2
More “natural” like how we measure distance
But takes lots of CPU time. Sometimes we ignore taking
square-root to save a bit of calculation
• Manhattan distance:
▫ dist 𝑝1 , 𝑝2 = 𝑥1 − 𝑦1 + ⋯ + 𝑥𝑘 − 𝑦𝑘
Fast calculation; preserves practical sense of “further means
larger distance”.
• Max distance:
▫ dist 𝑝1 , 𝑝2 = max( 𝑥1 − 𝑦1 , … , 𝑥𝑘 − 𝑦𝑘 )
Also fast calculation
A little hard to interpret from a layman’s perspective.
NBS 2016S1 AB1202 CCK-STAT-018
6
Clustering Methods
• All clustering methods are programmable steps to
use a distance function to assign a bunch of data
points to cluster numbers (1, 2, 3, …).
▫ ie all clustering methods need:
Data points
A distance function
Any other input needed by the clustering method
• We will look at 3 clustering methods:
▫ K-Means Clustering
▫ Agglomerative Clustering
▫ Divisive Clustering
NBS 2016S1 AB1202 CCK-STAT-018
7
K-Means Clustering
• If we have a pre-determined number of cluster 𝑘
(eg 2 clusters) in mind, then K-Means clustering
can be used.
• This is not as restrictive as it sounds, since
decision making through clustering typically does
not involve large number of clusters.
▫ Or we could also try to analyze clustering results
from 𝑘 = 2, 3 and 4 to compare and contrast.
• K-Means clustering also requires 𝑘 starting cluster
centers as input – different starting cluster centers
may result in different clustering outcomes.
NBS 2016S1 AB1202 CCK-STAT-018
8
Agglomerative Clustering
• The great idea:
▫ i. Start by letting each data point be its own cluster of 1.
▫ ii. Calculate average distance between all pairs of clusters.
▫ iii. Merge the pair of clusters with shortest distance.
▫ iv. Then repeat step (ii) until we get only one big cluster.
• Because we keep of history of the merging steps, it is
possible to draw the dendrogram with distance
information:
▫ We can “cut” the tree (dendrogram) to get any number of
cluster we want.
▫ If the distance before merger is relatively large compared to
other clusters, it is a good sign that they should be separate
clusters and should not be merged.
NBS 2016S1 AB1202 CCK-STAT-018
12
(dist(6,2)+dist(7,2)+dist(5,2)+
(dist(9,2)+dist(9,3))/2 = (dist(6,9)+dist(7,9)+dist(5,9))/3 =
dist(6,3)+dist(7,3)+dist(5,3))/6
(7+6)/2 = 6.5 (3+2+4)/3 = 3
= (4+5+3+3+4+2)/6 = 3.5
NBS 2016S1 AB1202 CCK-STAT-018
14
Divisive Clustering
• The great idea:
▫ i. Start by letting all data point as one big cluster.
▫ ii. Split each cluster into 2 new smaller clusters starting
with two furthest data points.
▫ iii. Then calculate average distance of all other remaining
data points to the two new clusters, moving the data
points to the closer new cluster.
▫ iv. Then repeat step (ii) until we get only one data point
per cluster.
• Also possible to draw the dendrogram with distance
information:
▫ We can “cut” the tree (dendrogram) to get any number of
cluster we want.
▫ If the distance before merger is relatively large compared
to other clusters, it is a good sign that they should be
separate clusters and should not be merged.
▫ May not be the same dendrogram as from Agglomerative
NBS 2016S1 AB1202 CCK-STAT-018
16
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2
C3
Height = 4