Sie sind auf Seite 1von 18

AB1202

Statistics and Analysis


Lecture 13
Cluster Analysis

Chin Chee Kai


cheekai@ntu.edu.sg
Nanyang Business School
Nanyang Technological University
NBS 2016S1 AB1202 CCK-STAT-018
2

Cluster Analysis
• Goals of Cluster Analysis
• Distance Function
• K-Means Clustering
• Agglomerative Clustering
• Divisive Clustering
NBS 2016S1 AB1202 CCK-STAT-018
3

Goals of Cluster Analysis


• Identify “similar” items in the same cluster
▫ Eg, low-quality bank loan applicants, big spending
customers, group of suspicious accounts, etc
• Detect structures in terms of clusters
▫ Eg, a retail store may not know it is serving two distinct
groups of customers until it clusters its members’
transactions.
• Filtering dissimilar data
▫ Eg, applying multiple regression on two distinct clusters of
data will produce meaningless results; instead, analyse each
cluster separately.
• Noise removal
▫ Cluster centers can be summaries of clusters, identifying
core characteristics, and allowing outliers and noise to be
ignored.
NBS 2016S1 AB1202 CCK-STAT-018
4

Distance Function
• A distance function is basically a formula applied on
two data points to give a decimal number (≥ 0).
• Remember that a data point may be
▫ a number (eg 2),
▫ a coordinate (eg (3, 4)), or
▫ a record of values (eg (20 years, 1.75 m, “Male”, 72 kg,
3.9999 GPA)).
• But to calculate distance, we convert all non-
numerical values into numerical values (see Coding
in Chapter 15).
▫ (20, 1.75, 1, 72, 3.9999)  we call this “vector”
• So, not surprisingly, when we think of a data point 𝑝
in clustering, it is also a vector 𝑥1 , 𝑥2 , … , 𝑥𝑘 .
NBS 2016S1 AB1202 CCK-STAT-018
5

Distance Function
• Suppose we have 2 data points 𝑝1 = 𝑥1 , 𝑥2 , … , 𝑥𝑘 and
𝑝2 = 𝑦1 , 𝑦2 , … , 𝑦𝑘
• 3 kinds of distance functions are commonly used.
• Euclidean distance:
▫ dist 𝑝1 , 𝑝2 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑘 − 𝑦𝑘 2
 More “natural” like how we measure distance
 But takes lots of CPU time. Sometimes we ignore taking
square-root to save a bit of calculation
• Manhattan distance:
▫ dist 𝑝1 , 𝑝2 = 𝑥1 − 𝑦1 + ⋯ + 𝑥𝑘 − 𝑦𝑘
 Fast calculation; preserves practical sense of “further means
larger distance”.
• Max distance:
▫ dist 𝑝1 , 𝑝2 = max⁡( 𝑥1 − 𝑦1 , … , 𝑥𝑘 − 𝑦𝑘 )
 Also fast calculation
 A little hard to interpret from a layman’s perspective.
NBS 2016S1 AB1202 CCK-STAT-018
6

Clustering Methods
• All clustering methods are programmable steps to
use a distance function to assign a bunch of data
points to cluster numbers (1, 2, 3, …).
▫ ie all clustering methods need:
 Data points
 A distance function
 Any other input needed by the clustering method
• We will look at 3 clustering methods:
▫ K-Means Clustering
▫ Agglomerative Clustering
▫ Divisive Clustering
NBS 2016S1 AB1202 CCK-STAT-018
7

K-Means Clustering
• If we have a pre-determined number of cluster 𝑘
(eg 2 clusters) in mind, then K-Means clustering
can be used.
• This is not as restrictive as it sounds, since
decision making through clustering typically does
not involve large number of clusters.
▫ Or we could also try to analyze clustering results
from 𝑘 = 2, 3 and 4 to compare and contrast.
• K-Means clustering also requires 𝑘 starting cluster
centers as input – different starting cluster centers
may result in different clustering outcomes.
NBS 2016S1 AB1202 CCK-STAT-018
8

K-Means Clustering Example


• Suppose we have a data set 2, 6, 9, 3, 5, 7. If we
impose 2 clusters on this data, which point belongs to
which cluster?
• Using K-Means with starting centers 1 and 10, we
calculate squared-Euclidean distances to center 1 (d1)
and 2 (d2) for every point.
[1] "Step 1: -------------"
name x curCluster newCluster d1 d2
1 a 2 0 1 1 64 Initial cluster centers
2 b 6 0 2 25 16
3 c 9 0 2 64 1 • The distances are tabulated on the
4 d 3 0 1 4 49 left. Each point is assigned to the
5 e 5 0 1 16 25
6 f 7 0 2 36 9
cluster whose center is closest to it,
[1] " Centers =====" as indicated by the “newCluster”
[1] 3.333333 7.333333 column.
• For the new clusters, new center
Cluster 1 new Cluster 2 new points are calculated by averaging
center = (2 + 3 + center = (6 + 9 + the data points.
5)/3 = 3.3333 7)/3 = 7.3333
NBS 2016S1 AB1202 CCK-STAT-018
9

K-Means Clustering Example


• Again, we calculate squared-Euclidean distance
for all data points to new cluster centers (3.3333
and 7.3333) to get updated columns of d1 and d2.
• Each data point is re-assigned to new cluster
closest to it.
• But we see no change, and so K-Means stops.
[1] "Step 2: -------------"
name x curCluster newCluster d1 d2
1 a 2 1 1 1.7777778 28.4444444
2 b 6 2 2 7.1111111 1.7777778
3 c 9 2 2 32.1111111 2.7777778
4 d 3 1 1 0.1111111 18.7777778
5 e 5 1 1 2.7777778 5.4444444
6 f 7 2 2 13.4444444 0.1111111
[1] " Centers ====="
[1] 3.333333 7.333333
Final clustering
assignments Final cluster centers
NBS 2016S1 AB1202 CCK-STAT-018
10

Agglomerative and Divisive Clustering


• Unlike K-Means, these methods do not require prior
knowledge of decision on the number of clusters.
They are, therefore, great to help detect inherent
clusters in the vast pool of data.
• Unlike K-Means also, they do not require guessing
initial cluster centers as they derive clustering results
based on data values themselves.

• Generates hierarchical layering of data points to show


which points are clustered with which.
▫ Thus, these are also called “Hierarchical Clustering”
▫ Drawing a tree diagram of the links results in what is
called “dendrogram” in clustering terminology (just an
inverted tree with leaves on the ground)
NBS 2016S1 AB1202 CCK-STAT-018
11

Agglomerative Clustering
• The great idea:
▫ i. Start by letting each data point be its own cluster of 1.
▫ ii. Calculate average distance between all pairs of clusters.
▫ iii. Merge the pair of clusters with shortest distance.
▫ iv. Then repeat step (ii) until we get only one big cluster.
• Because we keep of history of the merging steps, it is
possible to draw the dendrogram with distance
information:
▫ We can “cut” the tree (dendrogram) to get any number of
cluster we want.
▫ If the distance before merger is relatively large compared to
other clusters, it is a good sign that they should be separate
clusters and should not be merged.
NBS 2016S1 AB1202 CCK-STAT-018
12

Agglomerative Clustering Example


• Let’s cluster again 2, 6, 9, 3, 5, 7.
[1] "Step 1: -------------"
name x curCluster newCluster 1 2 3 4 5 6
1 a 2 1 7 - Clusters 1 and 4 are
2 b 6 2 2 4 - agglomerated into
3 c 9 3 3 7 3 - new cluster 7
4 d 3 4 7 1 3 6 -
5 e 5 5 5 3 1 4 2 -
6 f 7 6 6 5 1 2 4 2 -

[1] "Step 2: -------------"


name x curCluster newCluster 7 2 3 5 6
1 a 2 7 7
4 d 3 7 7 -
2 b 6 2 8 3.5 - Clusters 2 and 6
3 c 9 3 3 6.5 3 -
are agglomerated
5 e 5 5 5 2.5 1 4 -
6 f 7 6 8 4.5 1 2 2 -
into new cluster 8
(dist(6,2)+dist(6,3))/2 =
(4+3)/2 = 3.5 (dist(7,2)+dist(7,3))/2 =
(dist(5,2)+dist(5,3))/2 =
(dist(9,2)+dist(9,3))/2 = (3+2)/2 = 2.5 (5+4)/2 = 4.5
(7+6)/2 = 6.5
NBS 2016S1 AB1202 CCK-STAT-018
13

Agglomerative Clustering Example


• Let’s cluster again 2, 6, 9, 3, 5, 7. (dist(6,2)+dist(7,2)+dist(6,3)+dist(7,3)/4 =
(4+5+3+4)/4 = 4
[1] "Step 3: -------------"
(dist(9,2)+dist(9,3))/2 = (7+6)/2 = 6.5
name x curCluster newCluster 7 8 3 5
1 a 2 7 7 (dist(5,2)+dist(5,3))/2 = (3+2)/2 = 2.5
4 d 3 7 7 -
(dist(9,6)+dist(9,7))/2 = (3+2)/2 = 2.5
2 b 6 8 9
6 f 7 8 9 4 - (dist(5,6)+dist(5,7))/2 = (1+2)/2 = 1.5
3 c 9 3 3 6.5 2.5 -
5 e 5 5 9 2.5 1.5 4 - Clusters 8 and 5 are merged into
new cluster 9
[1] "Step 4: -------------"
name x curCluster newCluster 7 9 3
1 a 2 7 7
4 d 3 7 7 - Clusters 9 and 3 are merged into
2 b 6 9 10 new cluster 10
6 f 7 9 10
5 e 5 9 10 3.5 -
3 c 9 3 10 6.5 3 -

(dist(6,2)+dist(7,2)+dist(5,2)+
(dist(9,2)+dist(9,3))/2 = (dist(6,9)+dist(7,9)+dist(5,9))/3 =
dist(6,3)+dist(7,3)+dist(5,3))/6
(7+6)/2 = 6.5 (3+2+4)/3 = 3
= (4+5+3+3+4+2)/6 = 3.5
NBS 2016S1 AB1202 CCK-STAT-018
14

Agglomerative Clustering Example


• Let’s cluster again 2, 6, 9, 3, 5, 7.
[1] "Step 5: -------------"
(dist(6,2)+dist(7,2)
name x curCluster newCluster 7 10
+dist(5,2)+dist(9,2)
1 a 2 7 7 +dist(6,3)+dist(7,3)
4 d 3 7 7 - +dist(5,3)+dist(9,3)
2 b 6 10 10 )/8 =
6 f 7 10 10 (4+5+3+7+3+4+2+
5 e 5 10 10 6)/8 = 4.25
C11
3 c 9 10 10 4.25 -
Height = 4.25
Clusters 7 and 10 are merged into
new cluster 11. Done!
C10
Call: agnes(x = dis)
Height = 3
Agglomerative coefficient: 0.6666667
Order of objects:
[1] 2 3 6 7 5 9
Height (summary): C9
Min. 1st Qu. Median Mean 3rd Qu. Max. Height = 1.5
1.00 1.00 1.50 2.15 3.00 4.25 C7
> cst$height Height = 1 C8
[1] 1.00 4.25 1.00 1.50 3.00 Height = 1
> cst$order
[1] 1 4 2 6 5 3
NBS 2016S1 AB1202 CCK-STAT-018
15

Divisive Clustering
• The great idea:
▫ i. Start by letting all data point as one big cluster.
▫ ii. Split each cluster into 2 new smaller clusters starting
with two furthest data points.
▫ iii. Then calculate average distance of all other remaining
data points to the two new clusters, moving the data
points to the closer new cluster.
▫ iv. Then repeat step (ii) until we get only one data point
per cluster.
• Also possible to draw the dendrogram with distance
information:
▫ We can “cut” the tree (dendrogram) to get any number of
cluster we want.
▫ If the distance before merger is relatively large compared
to other clusters, it is a good sign that they should be
separate clusters and should not be merged.
▫ May not be the same dendrogram as from Agglomerative
NBS 2016S1 AB1202 CCK-STAT-018
16
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2

Divisive Clustering Example


• Let’s cluster yet again 2, 6, 9, 3, 5, 7.
[1] "Step 1: -------------"
name x curCluster newCluster Pt 2 6 9 3 5 7 Points 2 and 9 are starting
1 a 2 1 2 2 - points for dividing into 2
2 b 6 1 3 6 4 - clusters 2 and 3.
3 c 9 1 3 9 7 3 -
4 d 3 1 2 3 1 3 6 -
5 e 5 1 3 5 3 1 4 2 -
6 f 7 1 3 7 5 1 2 4 2 -

Points 2 and 3 are starting


[1] "Step 2: -------------"
name x curCluster newCluster Pt 2 3 6 9 5 7
points for dividing into 2
1 a 2 2 4 2 - clusters 4 and 5 (then stop).
4 d 3 2 5 3 1 -
2 b 6 3 6 6 - Points 5 and 9 are starting
3 c 9 3 7 9 3 - points for dividing into 2
5 e 5 3 6 5 1 4 - clusters 6 and 7.
6 f 7 3 6 7 1 2 2 -
NBS 2016S1 AB1202 CCK-STAT-018
17
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2

Divisive Clustering Example


• Let’s cluster yet again 2, 6, 9, 3, 5, 7.
[1] "Step 3: -------------"
name x curCluster newCluster Pt 2 3 6 5 7 9 Points 5 and 7 are starting
1 a 2 4 4 2 - points for dividing into 2
4 d 3 5 5 3 - clusters 8 and 9.
2 b 6 6 9 6 -
5 e 5 6 8 5 1 -
6 f 7 6 9 7 1 2 -
3 c 9 7 7 9 -

[1] "Step 4: -------------"


name x curCluster newCluster Pt 2 3 6 7 5 9 Points 6 and 7 are starting
1 a 2 4 4 2 - points for dividing into 2
4 d 3 5 5 3 - clusters 10 and 11 (and stop).
2 b 6 9 10 6 -
6 f 7 9 11 7 1 -
5 e 5 8 8 5 -
3 c 9 7 7 9 -
NBS 2016S1 AB1202 CCK-STAT-018
name x curCluster newCluster d1 d2 1 a 2 0 1 1 64 2 b 6 0 2 25 16 3 c 9 0 2 64 1 4 d 3 0 1 4 49 5 e 5 0 1 16 25 6 f 7 0 2 18

Divisive Clustering Example


• Let’s cluster yet again 2, 6, 9, 3, 5, 7.
[1] "Step 4: -------------"
name x curCluster newCluster Pt 2 3 6 7 5 9
1 a 2 4 4 2 -
4 d 3 5 5 3 -
2 b 6 10 10 6 -
6 f 7 11 11 7 - C1
5 e 5 8 8 5 -
3 c 9 7 7 9 - Height = 7

C3
Height = 4

> cst = diana(dis) C6


> cst$height Height = 2
[1] 1 7 1 2 4
> cst$order C2
C9
[1] 1 4 2 6 5 3 Height = 1
Height = 1

Das könnte Ihnen auch gefallen