Sie sind auf Seite 1von 4

Cluster Analysis

Cluster Analysis is a multivariate statistical technique which is used to classify variables into
groups when the groups are initially not known. One reason to cluster variables may be to
reduce their number. This technique may give new variables that are more intuitively
understood than those found using principal components. In a nutshell Cluster Analysis is
multivariate statistical classification technique employed for grouping variables based on
some similarity.

This procedure is an agglomerative hierarchical method that begins with all variables
separate, each forming its own cluster. In the first step, the two variables closest together are
joined. In the next step, either a third variable joins the first two, or two other variables join
together into a different cluster. This process will continue until all clusters are joined into
one, but you must decide how many groups are logical for your data

The final grouping of clusters (also called the final partition) is the grouping of clusters which
will, hopefully, identify groups whose observations or variables share common
characteristics. The decision about final grouping is also called cutting the dendrogram . The
complete dendrogram (tree diagram) is a graphical depiction of the amalgamation of
observations or variables into one cluster. Cutting the dendrogram is akin to drawing a line
across the dendrogram to specify the final grouping.

After choosing where you wish to make your partition, rerun the clustering procedure, using
either Number of clusters or Similarity level to give you either a set number of groups or a
similarity level for cutting the dendrogram. Examine the resulting clusters in the final
partition to see if the grouping seems logical. Looking at dendrograms for different final
groupings can also help you to decide which one makes the most sense for your data.
Dendrogram

Dendogram illustrates the information in the amalgamation table in the form of a tree
diagram.

By default, similarity level is measured along the vertical axis (alternately, you can display
the distance level), and the different observations are listed along the horizontal axis. The
graph shows the manner in which the clusters were formed– either by joining two individual
observations, or pairing an individual observation with an existing cluster. You can see at
what similarity levels the clusters are formed, and the composition of the clusters of the final
partition.

For some data sets, average, centroid, median and Ward's linkage methods do not produce a
hierarchical dendrogram, meaning amalgamation distances do not always increase with each
step. In the dendrogram, such a step produces a join that goes downward rather than upward.

Linkage methods

Linkage methods determine how the distance between two clusters is defined. Choosing one
over another may not make an appreciable difference with your data. However, because the
goal of cluster amalgamation is somewhat subjective, you may find different methods are
more or less appropriate with your particular situation and data.

Method Distance between two clusters is... Reasons to use this method
Average The mean distance between an item Whereas the single or complete linkage
in one cluster and an item in the methods group clusters based upon
other cluster. single pair distances, average linkage
uses a more central measure of
location.
Centroid The distance between the cluster Like average linkage, this method is
centroids or means. another averaging technique.
Complete The maximum distance between an Ensures that all items in a cluster are
item in one cluster and an item in the within a maximum distance and tends
other cluster. Also called "furthest to produce clusters with similar
neighbor" method. diameters. The results can be sensitive
to outliers.
McQuitty's The average of the distances of the Here, distance depends on a
soon to be joined clusters to that combination of clusters rather than
other cluster. For example, if clusters individual items in the clusters. Similar
1 and 3 are to be joined into a new to average linkage, but the size of the
cluster, say 1*, then the distance clusters are assumed equal, so the
from 1* to cluster 4 is the average of pairwise distances are weighted
the distances from 1 to 4 and 3 to 4. accordingly.
Also called "weighted average
linkage."
Median The median distance between an item Similar to average or centroid method,
in one cluster and an item in the though it reduces the effect of outliers.
other cluster.
Single The minimum distance between an Best suited for observations or
item in one cluster and an item in the variables that are clearly separated.
other cluster. Also called the "nearest When they lie close together, single
neighbor" method. linkage tends to identify long chain-
like clusters that can have a relatively
large distance separating items at either
end of the chain.
Ward's A function of the linkage criteria: the Tends to produce clusters with similar
sum of squared deviations from numbers of items, but it is sensitive to
points to centroids, minimizing the outliers.
within-cluster sum of squares.

Final partition

The final grouping of clusters which identifies groups whose observations share useful
common characteristics. Because the definition of a useful grouping depends entirely on your
particular situation, you must specify the criteria for placing the final partition. You can
choose to define its placement based on the number of clusters you want to achieve, or by the
similarity level you require within clusters, though in practice you may well first run a
cluster.

Das könnte Ihnen auch gefallen