Beruflich Dokumente
Kultur Dokumente
Cluster Analysis
(classification analysis, numerical
taxonomy):
a class of techniques used to classify objects or
cases into relatively homogeneous groups called
clusters based on the set of variables considered.
Objects in each cluster tend to be similar to each
other and dissimilar to objects in the other
clusters.
objects: either variables or observations;
likeness: calculated from the measurements for
each object.
Applications:
1.
2.
3.
4.
Model:
Data: each object is characterized by a set of
numbers (measurements);
e.g., object 1: (x11, x12, , x1n)
object 2: (x21, x22, , x2n)
:
:
object p: (xp1, xp2, , xpn)
Distance: Euclidean distance, dij,
d ij
i1
x j1 xi 2 x j 2 xin x jn
2
Example
A
B
C
D
Household
Income
50K
50K
20K
20K
Household
Size
5
4
2
1
Size
2
4.24 3 3
A
1
B
3.61 2 2 32
C
D
$
(unit: 10K)
50K
20K
High
Low
Low
High
Frequency of going to fast food restaurants
High
Low
Low
High
Frequency of going to fast food restaurants
Score
7
6
5
4
3
2
1
Respondent A
Respondent B
Respondent C
Respondent D
Clustering procedures
Hierarchical procedures
Agglomerative (start from n clusters to
get to 1 cluster)
Divisive (start from 1 cluster to get to n
clusters)
Non hierarchical procedures
K-means clustering
Hierarchical clustering
Agglomerative:
Each of the n observations constitutes a separate cluster
The two clusters that are more similar according to some distance rule are
aggregated, so that in step 1 there are n-1 clusters
In the second step another cluster is formed (n-2 clusters), by nesting the two
clusters that are more similar, and so on
There is a merging in each step until all observations end up in a single
cluster in the final step.
Divisive
All observations are initially assumed to belong to a single cluster
The most dissimilar observation(s) is extracted to form a separate cluster
In step 1 there will be 2 clusters, in the second step three clusters and so on,
until the final step will produce as many clusters as the number of
observations. This technique is used in medical research and not in the
scope of our course.
Non-hierarchical clustering
These algorithms do not follow a hierarchy and produce a
single partition
Knowledge of the number of clusters (c) is required
In the first step, initial cluster centres (the seeds) are
determined for each of the c clusters, either by the
researcher or by the software.
Each iteration allocates observations to each of the c
clusters, based on their distance from the cluster centres
Cluster centres are computed again and observations may
be reallocated to the nearest cluster in the next iteration
When no observations can be reallocated or a stopping rule
is met, the process stops
Linkage methods
Single linkage method (nearest neighbour):
distance between two clusters is the minimum
distance among all possible distances between
observations belonging to the two clusters.
Complete linkage method (furthest neighbour):
nests two cluster using as a basis the maximum
distance between observations belonging to
separate clusters.
Average linkage method: the distance between
two clusters is the average of all distances
between observations in the two clusters.
Ward algorithm
1. The sum of squared distances is computed
within each of the cluster, considering all
distances between observation within the same
cluster
2. The algorithm proceeds by choosing the
aggregation between two clusters which
generates the smallest increase in the total sum
of squared distances.
It is a computationally intensive method,
because at each step all the sum of squared
distances need to be computed, together with all
potential increases in the total sum of squared
distances for each possible aggregation of
clusters.
Non-hierarchical clustering:
K-means method
The number k of clusters is fixed
An initial set of k seeds (aggregation centres) is
provided
First k elements
Given a certain fixed threshold, all units are
assigned to the nearest cluster seed
New seeds are computed
Go back to step 3 until no reclassification is
necessary
Units can be reassigned in successive steps
(optimising partioning)
Non-hierarchical methods
Outlairs
It would affect your cluster solution if you
dont remove it!
It would affect your cluster solution if you
remove it! (small sample size)
Variable Description
Work Environment Measures
X1
I am paid fairly for the work I do.
X2
I am doing the kind of work I want.
X3
My supervisor gives credit and praise for work well done.
X4
There is a lot of cooperation among the members of my work group.
group.
X5
My job allows me to learn new skills.
X6
My supervisor recognizes my potential.
X7
My work gives me a sense of accomplishment.
X8
My immediate work group functions as a team.
X9
My pay reflects the effort I put into doing my work.
X10 My supervisor is friendly and helpful.
X11 The members of my work group have the skills and/or training
to do their job well.
X12 The benefits I receive are reasonable.
Relationship Measures
X13 I have a sense of loyalty to McDonald's restaurant.
X14 I am willing to put in a great deal of effort beyond that
expected to help McDonald's restaurant to be successful.
X15 I am proud to tell others that I work for McDonald's restaurant.
Classification Variables
X16 Intention to Search
X17 Length of Time an Employee
X18 Work Type = PartPart-Time vs. FullFull-Time
X19 Gender
X20 Age
X21 Performance
Type
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Metric
Nonmetric
Nonmetric
Nonmetric
Metric
Metric
For this example we are looking for subgroups among all the 63
employees of McDonald's restaurant using the organizational
commitment
commitment variables. The SPSS click through sequence is: Analyze
Classify Hierarchical Cluster. This will take you to a dialog box where
where
you select and move variables X13, X14 and X15 into the Variables
Variables box.
Next you go to the statistics box and agglomeration schedule is selected as
default option. Cluster membership none
none is selected as default. We shall
continue with default option here. Next click on plot
plot box. Check on
dendogram and in Icicle window, click on none button. Then continue.
Next click on the Method box and select Ward
Wards under Cluster Method (it
is the last option). Squared Euclidean Distances is the default under
Measure and we will use it, and we do not need to standardize this
this data.
We will not select anything on the save option now. Now click on OK
OK to
run the program.
Identified
Identifiedthe
thenumber
number
ofofclusters
clustersfrom
from
dendogram
dendogram
33
34
35
36
ANOVA
Move
cluster ID
variable into
window
2
Click on Options,
check Descriptive,
next Continue,
and then OK
37
Conclusion:
Conclusion:
Cluster 1
More Committed
Cluster 2
Less Committed
38
Take 2 cluster ID
variable out and
insert 3 cluster ID
Click on Post
Hoc button and
check Scheffe
39
Conclusions:
Conclusions:
Cluster 1 Least Committed
Cluster 2 Moderately Committed
Cluster 3 Most Committed
Individual cluster sample sizes OK.
Clusters significantly different, but
must examine post hoc tests.
40
41
Click OK
to run
1 Remove 3 cluster ID
variable and insert 4
cluster ID variable
42
43
44
Error Reduction:
Reduction:
1 2 Clusters = 58.4%
2 3 Clusters = 25.5%
3 4 Clusters = 22.8%
4 5 Clusters = 22.2%
Conclusion:
Conclusion: benefit
similar or less after 3
clusters.
Error
Coefficients
45
Insert
Demogra
phic
Variables
46
Assign value
labels for
clusters
47
Thank you