Beruflich Dokumente
Kultur Dokumente
GRAVITATIONAL CLUSTERING
W. E. W~16HT
Southern Illinois University at Carbondale, Carbondale, IL 62901, U.S.A.
Abstract--This paper introduces and describes an algorithm or technique, called gravitational clustering,
for performing cluster analysis on Euclidean data. The paper describes the physical gravitational model,
an abstract generalized model, and several specific gravitational models. It illustrates clustering by
one of these models using several sample data sets, and compares the results with those obtained
using two other well-known nongravitational clustering methods. The paper also illustrates four graphi-
cal techniques to aid in the analysis of a clustering.
Cluster analysis Gravitational clustering Classification Contour plot Euclidean data
INTRODUCTION the new particle will be the union of the clusters cor-
responding to the joining particles. The last two clus-
Cluster analysis may be defined intuitively as the par- ters remaining, whose union is the entire set of ori-
titioning of a set of elements into subsets, called clus- ginal particles, define the "optimal" 2-level clustering.
ters, so as to maximize similarity within clusters and The last three clusters remaining define the "optimal"
minimize similarity between clusters. Many heuristic 3-level clustering, and so on for 4, 5, etc.
methods have been set forth to carry out this general This method is clearly an agglomerative, hierarchial
notion of clustering, each having certain apparent algorithm such as described in WardJ 91 There is, how-
advantages and disadvantages, and each appearing to ever, no objective function which is minimized or
conform more or less to this general notion. This maximized at each stage. Instead, the joining of par-
paper is concerned with another such method, in- ticles is determined by the continuous motion of all
spired by the motion of particles in space due to their particles in the system according to the gravitational
gravitational attraction for each other, and called forces. Coleman ~2) discusses a method of clustering
gravitational clustering, which also uses a system of forces, although it is
There is a phenomenon in physics in which a finite otherwise unlike this method. The attributes of time
system of particles in space, each with a specified in- and mass are not a part of the model (although mass
itial location, a zero velocity, a given mass, and a is mentioned briefly), and particles do not converge
negligible volume, converges to the centroid of the together but rather converge to "equilibrium" pos-
system due to the gravitational attraction between the itions. Coleman's model is based on attributes called
masses. As the particles travel toward the centroid "psychological distances" and employs repulsive as
they join through inelastic collisions with others well as attractive forces.
closest to them, forming what may be considered to The element of time in the gravitational model is
be new, conglomerate, particles. This process results very important, adding a new dimension to the cluster
in a system of conglomerate particles which are in- analysis. As was mentioned previously, the sequence
creasing in size, decreasing in number, and converging in which the particles come together defines the
toward the centroid. Eventually there will be three various clusterings, but the times at which such join-
conglomerate particles remaining, then only two, and ings occur, and the intervals between the joinings,
finally just one, with mass equaling the sum of the give additional insight into the relative strength of
masses of the original particles, and located at the each cluster and clustering.
centroid. The general principle is that the longer the system
This natural phenomenon gives rise to a general exists with a certain number of particles (clusters),
method of clustering which shall be called "gravita- the greater is the strength of the clustering at that
tional clustering". The implementation of the cluster- level. Similarly, the greater the length of time between
ing algorithm consists of simulating the movement the creation of a cluster and its union with another
of the particles from their original positions to the cluster, the greater the strength of the former. A nor-
final position, and analyzing this movement to deter- malizing factor is the total time required for the sys-
mine the clustering results. Throughout the life of the tem to be reduced to one cluster.
system, each remaining particle corresponds to a clus- As an example, suppose that a system of 1130 par-
ter. Initially there will be as many clusters as there ticles combines very rapidly at first so that at time
are original particles, with each cluster being com- 15 sec (or other unit) there are 3 clusters remaining.
prised of a single particle. Whenever two particles join Now suppose that these clusters are relatively far
to form a new particle, the cluster corresponding to apart, and that 2 of them do not.join until time 65 sec.
151
152 W.E. WRIGHT
Finally, suppose that the 2 dusters remaining join particles come together. When this happens the forces
fairly quickly at time 73 sec. This model would indi- acting on the two particles approach infinity in mag-
cate then that the clustering with the greatest strength nitude, as do the velocities and accelerations of the
is at the 3 level and that it is very strong, having particles. This phenomenon, however, causes neither
accounted for (65 - 15)/73. 100% = 68% of the total theoretical nor practical problems in the clustering
life of the system. model. Although these force, velocity, and acceler-
The comparison of the strengths of clusterings can ation vectors approach infinity in magnitude, their
be extended to comparisons between different data sums are finite since they are almost opposite in direc-
sets. For example, with one data set the optimal clus- tion (the influence of other particles keeps them
tering may be at the 3 level and may occupy 68% generally from being exactly opposite). Assuming they
of the life of the system. With another data set, the combine into a single particle on impact, the motion
optimal clustering may be at the 2 level and may of this particle immediately after impact is finite, since
occupy only 45% of the system life. This would indi- the infinite components cancel out.
cate that the first system clusters much better than Treatment of this occurrence in the practical model
the second. is very straightforward. Since the particles coming
It is important to note that while true gravitational together have velocities approaching infinity, the time
motion has served as a model for this clustering for them to collide, starting from a short distance
method, the method has not been limited to the use • apart, approaches zero. The rest of the particles,
of all the various physical laws involved. The purpose of course, are virtually motionless during such a short
of the method is to do cluster analysis, and there period (except when there are multiple collisions). An
are no theoretical or logical restrictions against modi- approximate implementation of a collision can there-
fying the method so as to more completely implement fore be made by first recognizing when two particles
notions of clustering. are within • distance of each other, and then replacing
With this generalized view of the model, the move- them by a single particle located at their centroid.
ment of each particle during each time interval is Because of the influence of other particles, the colli-
given by a function g, which has been termed a gravi- sion would not in general occur exactly at the cen-
tational function. The specification of a gravitational troid, but the difference can be made negligible by
function thus determines a specific gravitational clus- making • small. When there are multiple collisions
tering model. Obviously, not all gravitational func- occurring at the same time, they can be treated as
tions give rise to models which yield reasonable a sequence of collisions at that time. The value of
results from the point of view of cluster analysis, and • is currently specified as a parameter of approxima-
the entire concept of gravitational clustering is depen- tion, usually 2.6.
dent on whether or not there exist one or more gravi-
tational functions which yield good clustering results.
There do in fact exist such functions, one of which T H E F O R M A L M O D E L AND SOME EXAMPLES
will be illustrated in some detail later.
The "simulation" aspect of the model arises because The formal model
the movement of the particles is approximated over The true gravitational model described briefly in
small discrete time intervals [t,t + dt], according to the previous section is a type of clustering procedure,
the forces acting on the particles at the beginning in that it starts with a set of points and winds up
of the intervals. In theory, the forces are a function grouping them into various clusters "so as to maxi-
of continuous time, and motion of the particles occurs mize similarity within clusters and minimize similarity
continually rather than in discrete jumps. However, between clusters" in some sense. There is, however,
it is quite feasible to get very good approximations no justification for assuming that the particular rela-
of continuous motion using this discrete simulation. tions, formulas, and operating characteristics in that
The procedure currently implemented determines model would be appropriate for every or any situ-
the time increment dt at each step so as to permit ation in which clustering is desired. But--the general
the fastest moving particle to move a distance 6, operation of the model, by which original elements
which is a parameter of approximation. This par- move continuously ."toward" each other according to
ameter should be set small enough that the simulated a specified "gravitational" relation, joining other ele-
motion is substantially the same as the theoretical ments along the way to form "clusters" which combine
motion would be. However, since a smaller value o f ultimately into one, is a very useful model which shall
6 will require a greater amount of computer process- be called a "gravitational clustering" procedure. The
ing time, c5 should not be set unnecessarily small. formal gravitational clustering procedure is as fol-
Simple experimentation, with regard for the separ- lows:
ation of the original data points, has been a satisfac- (a) A finite number n of elements or particles
tory procedure for determining 6 thus far. For large Pl,..., P, are originally specified by their locations or
problems it may be necessary to compromise between measurements sl,..., s,, and by their masses m~,...,
speed and accuracy in setting 6. m,. It is assumed that all measurements have been
Critical points in the process occur whenever two appropriately scaled so as to yield this set of n points
Gravitational clustering 153
in Euclidean m-space. No scaling problems are con- The final equation above gives the gravitational func-
sidered here. tion g(i, t, dr), therefore the physical model is in fact
(b) Two parameters of approximation, 6 and E, a gravitational clustering model.
must be specified. In each time interval the length
dt of the interval will be determined so that the fastest A modification to the physical model
moving particle will move the distance 6. When two As mentioned earlier, there has been no justifica-
particles come within E distance of each other, they tion for assuming that the particular operating
are joined into a single particle at their centroid. A characteristics of the physical model are appropriate
good value for E is 2-ft. for any or every clustering problem. Thus, it is of
(c) The time t is initialized to 0. Steps d, e, and interest to consider modifications to the model which
f are repeatedly executed until only one distinct par- might improve its properties or performance. One
ticle remains. such modification imparts to the model a generally
(d) The movement of each remaining particle i dur- highly desirable property, which shall be called Mar-
ing the interval [t, t + dt] is computed according to kovian.
the function g(i, t, dr), which is a function of the attri- A gravitational model will be said to be a Marko-
butes of particle i and all other remaining particles. vian model if the function gi,d,(t)= g(i, t, dt), con-
The time increment dt is determined so that the lon- sidered as a function of t, depends only on the present
gest distance moved will be 6 (there are other possible attributes (locations and masses) of the remaining
ways for determining dt or specifying 6). particles and not on any past history. By way of clari-
(e) The new position s~t + dt) of each remaining fication, let P denote a set of particles (the location
particle i is computed as the sum and mass of each particle). For t _> 0 let P(t) denote
s~(t + dt) ~-- s~(t) + g(i, t, dt). The time t is incremented the set at time t in a gravitational clustering of /5.
by dr, i.e. t *---t + dt. Clearly, P(0) = P. Now suppose u >_ 0 and P(u) = Q
(f) If any pair of particles i and j have moved to for some set of particles Q. Then for a Markovian
within e distance of each other, then particle j is gravitational model, P(t) = Q(t - u) for every t _> u.
joined into particle i (assume i < j). Specifically, The physical model is not Markovian because the
si(t) ~--(mi(t)'si(t) + ml(t)" s,{t))/(mi(t) + m/(t)), mi(t),--- gravitational function at a particular time t depends
mi(t) + mj(t), and particle j is deleted. not only on the locations and masses of the particles
at that time but also on the velocities which they
The physical model reexamined have built up. A Markovian model can be obtained
Now that the definition of a gravitational clustering from the physical model by merely deleting the vel-
model has been given, it will be shown that the physi- ocity term in the motion formula (1). The net effect
cal model discussed earlier is in fact a gravitational on the formal model is that the gravitational function
model. To do this it is necessary only to give the g(i, t, dt) becomes
function g = g(i, t, dr), which gives the movement of
mi(t)mi(t) s j ( t ) - si(t)
particle i at time t. The following development does g(i,t, dt) = 1/2G ~ mi(t) Isj(t) - si(t)l 3 dt2" (2)
this (assume N(r) is the set of particles remaining at ja Nlt),j ~ i
time r): This new model is analogous to motion in a viscous
fluid, in which particles cannot build up any momen-
g(i, t, dr) = vi(t) dt + 1/2ai(t) dt 2 (1) tum. Each particle moves directly toward its net
attraction, without any orbiting or oscillation. As
Vi(t) =
f2 ai(r)dr mentioned, the true physical model does not possess
these qualities. While its particles may ultimately
ai(r) = ~. G mi(r)mj(r) 1 meet at the centroid, they might have oscillating and
j¢N(r),j4=i mi(r) Isj(r) -- si(r)l 2 orbiting tendencies along the way. It is obvious that
with the velocity component deleted, as in the modi-
sj(r) - si(r)
× fied model, each particle will necessarily move toward
Isj(r) - si(r)l its net attraction since it has no influence otherwise.
me~(t)mqj(t) 1
+ 1/2G ~ mi(t)mj(t) g(i,t, dt) = dt2 j~N(thjZ ~ i mi(t) ts)(t) -- si(t)l 2
jeN(t),j~i mi(t)
sat) - s~(t)
× ISj(t) -- si(t)l' (3)
X
Isj(t) -- si(t)l 2 Isj(t) -- I dt2' where p and q are real numbers. Equation (2) is
154 W.E. WRIGHT
obtained by setting p = 1 and q = 1 in (3) (note that and usefulness in physical models. Some clustering
the irrelevant factors 1/2 and G have been dropped). problems may dictate the preservation of the centroid
A model which has given excellent empirical results by their nature and properties, but other problems
is defined by (3) with p = 0 and q = 0: might not necessarily imply such a restriction.
sat) - s~(t) Two more examples will be mentioned briefly. In
g(i, t, dt) = dt2 Z1 ) [sj(t)
j~N(t),j:/:imi(t -- si(t)[ 3" (4) the first one, the gravitational function is given by
then this model is "centroid-preserving'. Thus, like of the clustering hierarchy will be vividly portrayed,
true centroid-preserving models, the location of the and the strengths of the clusters will also be indicated,
final composite particle can be determined without since a relatively strong cluster might exist over
executing the model. several subintervals of time. An illustration of this
plot has been given by Wright/12) Napior 14~ presents
INTERPRETATION OF RESULTS
a contour-type display in which subsets are manually
encircled to indicate clustering structure. These cir-
Now that the technique of gravitational clustering cles, however, do not directly correspond to any
has been described, its specific application to cluster measured value such as time or distance, and only
analysis will be considered in greater detail. What is a single set of circles, rather than a hierarchy, is used.
the optimal clustering? How many clusters does it Rohlf ~51uses manually drawn "contour circles" to in-
contain? How strong is it? What are the subclusters dicate a hierarchy of clusters.
and how strong are they? What is the complete hier- Clearly, the contour plot is restricted to 2-dimen-
archy of subclusters? Gravitational clustering pro- sional data, although some insight into 3- or 4-dimen-
vides answers to these questions through the analysis sional data may be gained by the use of projections.
of the movement of the elements as they "gravitate" While such a restriction is very significant for practi-
toward each other. Several techniques have been cal clustering problems, it does not greatly diminish
designed for observing the process, and these will now the value of this display for studying or testing clus-
be described briefly. For a more complete discussion tering procedures, since artificial 2-dimensional data
see Wright. ~1°1 can usually be used for that purpose with no loss
The most basic form for displaying the gravi- in generality.
tational process is to list the occurrence of events (cf. The tree plot (cf. Fig. 2c) is analogous to the den-
Fig. 1). These events are of two types, the movement grograms described by Sokal and Sheath 17~ and more
of a particle and the joining of two or more particles. recently by Sheath and Sokal, C6) which are a common
The contour plot (cf. Fig. 2b) is probably the most display tool in cluster analysis. It is unrestricted as
graphical of the displays presented here. Its purpose to dimensionality and yields a continuous interpre-
is to illustrate the clustering structure as it exists at tation of the state of the model. The plot consists
various times in the simulated process. The points of a binary tree in which each node corresponds to
are initially plotted, and a segment of time is chosen a cluster in the clustering hierarchy, and each branch
so as to divide the total life of the system into a indicates a hierarchical relationship. The root natur-
specified number of equal subintervals. At the end ally corresponds to the entire data set, and the ter-
of the first subinterval the clusters which have formed minal nodes correspond to individual data elements.
up to that time are determined, and a convex poly- The cluster corresponding to a node consists of the
hedron is drawn around each such cluster. This step terminal nodes in the subtree having that node as
is then repeated for the second subinterval, and so its root, and the vertical component of the node indi-
on up to the last subinterval. In this manner much cates the time of formation of the cluster.
156 Gravitational clustering
0 0
0 0
d"
0 0
0 C)
e. m.
4- 4- 4-
÷ .4Lt-
4.
+a:+ ~'++'$" + )-
44- 4. 4-
o + 4- o
o ++,4,. ~.4.4.
o
d" d"
0 0
0 C}
! -" i. !
i
~0', O0 E.O0 q'.O0 6'.00 O0 2.00 q. O0 6.00
X ×
(a) (b)
oa
,L
O4
O4
04
hl
l--.w
0
°
Fig. 2.
A time plot (cf. Fig. 2d) is a display which is used motion of the elements in the simulation process. One
to give a crude but easily understandable picture of such display consists of plotting the positions of the
the relative strengths of the clustering levels. In this remaining elements at regular intervals of time, and
plot, a horizontal line segment is drawn correspond- is called a position plot (of. Fig. 3). Extension of this
ing to each binary joining. All the segments are equal to the continuous case would yield a motion plot
in length but their height varies according to which would consist of a continual display of the
the amount of time between successive joinings. The moving particles on a display tube or in a movie pic-
first (left-most) line segment drawn indicates the time ture. This display might be especially helpful in
until the first binary joining, the second segment in- observing the effects of scaling adjustments on the
dicates the time between the first and second join- data, comparing the motions obtained using different
ings, and so on until the (n - 1)th segment indicates gravitational functions, and picking out "borderline"
the time between the (n - 2)th and the (n - 1)th binary elements which are close to moving into more than
joining. one cluster. Note of course that these displays are
Two other displays might be used to indicate the also restricted to 2-dimensional data.
Gravitational clustering 157
c)
+
+ +
~-- °~ -¸ +
+ +
+++ + +
+ +
+ +
+
+ +
+
(xj
l
1.50
I oJ i
2.50 3.50 1.50 2.50 3.50
X
>- e~"
+ +
+ "~ + +
+ ÷
i i I
l .50 2.50 3.150 T.50 2.50 3.50
X
4¸
+ +
+
+ + +
Position Plot
Fig. 3.
0 0
4, 4-
(o
e.
d
0 ÷ ÷÷
*= ÷ ÷ ÷
÷ ÷
÷
÷
Ib ÷ I' ÷ 4, ÷÷
0 + ÷t2÷
II-
÷÷÷
+% %
÷~ ~
!00 2.00
m
X
q.O0
m &
6.00 ~'.oo zLoo
X
~~'~Loo
(a) (b)
P~
(D
!
+ 'I
tance of each other. The clusters consist of the maxi- set. For the gravitational model, it also depends on
mal sets of connected points, in which each point is the separation of the data and the maximum step
connected to every other point by a chain of single size 6. Specifically, it varies directly with the number
links. The threshold is started at zero and increased of dimensions, and inversely with 6. For most of the
until finally all the points are linked into one cluster. data sets used here, involving 100 points in 2-dimen-
So again we have a sequence of events in which points sions, the run times for the centroid replacement
or groups of points combine at certain values of a method were approximately 11 sec, and for the single-
parameter (threshold) and form larger groups which link method 9 sec. For the gravitational method
are fewer in number. with 6 = 0.03 the run times were approx. 44sec. For
All clustering programs are written in FORTRAN the 60 point data set with 6 --- 0.03, the gravitational
IV and have been run on an IBM 370/158 computer. run time was approx. 8 sec. The smaller time was due
The amount of CPU time required to execute the not so much to the smaller number of points but
programs naturally depends on the size of the data to the smaller separation between the points. Using
Gravitational clustering 159
different values for 6, gravitational clustering has been clusters were in existence for ca. 27 and 35°i,, respect-
done in a few minutes on data sets of 114 points ively, of the life of the system.
in 44 dimensions and 540 points in 4 dimensions. The time plot is given in Fig. 2(d). It also indicates
that the strongest clustering level is 2, the next stron-
Arbitrarily generated points
gest 3 (21, 8, and 31 points), and the next strongest
The first sample to be considered consists of a set 4 (21, 8, 20, and 11 points). After that, little significant
of 60 points manually marked in 2-dimensional space. strength is shown.
The points are plotted in Fig. 2(a). The gravitational The tree plot is given in Fig. 2(c), and it provides
clustering produced the contour plot of Fig. 2(b) a very complete picture of the simulation. It indicates
referred to earlier. The total simulated time was that the system existed with 2 clusters for ca. 40 sec,
62.6sec, and the contour lines correspond to with 3 clusters for ca. I I sec, with 4 clusters for
10-sec intervals. It is clear from the contour plot 4.5 sec etc. (This may be checked with the time plot.)
that the points cluster best into two clusters, one con- The lower cluster, containing 29 points, existed for
taining 29 points and the other containing 31 points. ca. 17 sec. Its 8-point subcluster was unusually strong,
Approximately 64°0 of the life of the system was spent forming at time 0.7 sec and not joining with the other
with 2 particles remaining. The lower cluster also has subcluster until time 22.5 sec, for a total life of ca.
two reasonably good subclusters, one containing 21 22 sec. The 31-point cluster was not so well separated,
points and the other containing 8 points. These sub- dividing into subclusters at a rather uniform rate.
÷ 4-+ 4- ~ 4-
* %+
. -+ 7 \: ,.+ ""
+ +
(:)
d"
C)
=#-
P.R. 9/3 i)
160 W.E. WRIGHT
Two normal distributions The gravitational method seems to pick off the sep-
The second data set consists of 50 points each from aration very nicely. In the centroid replacement
two normal distributions with different means and method, the lower duster appears to reach too high
different dispersions (Fig. 4a). Contour plots using the for some of its points. The single-link method gives
gravitational method, centroid-replacement method, little indication of the clustering structure.
and single-link method are given in Figs. 4(b), (c),
and (d), respectively. Because of the closeness of the Different-sized populations
clusters, some of the contour lines overlap and make The third data set consists of 30 points from one
it difficult to determine the exact allocation of some normal distribution with a small dispersion and 70
of the points. Fewer contour lines might make an points from another normal distribution with a larger
interpretation easier, or a tree plot could be studied dispersion (Fig. 5a). The gravitational, centroid re-
to determine the exact allocation. placement, and single link dusterings are indicated
-24-
÷ 4. 4-
g 4.4. g
4-
,;" **.~* 4-,, ~;
#4-~ 4.
4- 4.~1: ,
g ,~ g
=:" 4- 4-* *~ **~ ¢4-
4-nm-.t--4-4-4- *
4-* ÷-I,÷ 4-
,q. *4- ÷ ~-]
4-.I- • 4.
4. ,
00
mm
%"" ,oo d. oo X i- q.oo
' 6:O0 %'. oo 2. oo~"----%,
× oo 6'
. oo
in Figs. 5(b), (c), and (d), respectively. Again, the gravi- significant enough to comprise an independent cluster
tational method seems to this observer to pick up at the 2 level.
the clusters very satisfactorily, with the centroid re-
placement method doing some surprising allocation, Iris species
and the single link method being of little help. The fifth example is quite well known and consists
of measurements taken on 150 Iris flowers (c.f.
Different-sized elliptical distributions Fisher t31 and Anderson. Ill According to botanical
The fourth data set consists of one elliptical distri- classification, the first 50 samples were from the
bution containing 90 points with a large dispersion, species Iris setosa, the next 50 from the species Iris
and another elliptical distribution containing 10 versicolor, and the next 50 from the species Iris vir-
points with a small dispersion (Fig. 6a). The cluster- ginica. Previous results have indicated that the Iris
ings are given as usual in Fig. 6(b), (c) and (d). Note versicolor and Iris virginica are closely related to each
that both the gravitational method and the centroid other, but are rather distinct from the Iris setosa.
replacement method pick up the small cluster, Four measurements were taken on each flower, con-
although it is also grouped at the 2-level clustering sisting of the sepal length, the sepal width, the petal
with half of the large distribution, because it is not length, and the petal width.
c:~
oJ
tO
d
ILl
I-'- °
° ; /
tO
Three Species o f I r i s
Fig. 7.
162 W.E. WRIGHT
The tree plot for this sample is given in Fig. 7. is given in Fig. 8. The life times for the 2-, 3-, 4-,
The element numbers at the bottom of the plot are and 5-level clusterings are 40, 19, 7, 12, and 25~o re-
unreadable on this small a scale, but the essence of spectively. The unusual strength at the 5 level is par-
the results can still be understood with a little ticularly noteworthy, and might indicate an alterna-
explanation. The results are consistent with the earlier tive species classification. The intermixing between
indications that the Iris setosa is quite distinct from the two species can be seen from Fig. 8.
the other two species, since the first 50 elements come
together very rapidly and then remain as an indivi- Occupational groupings
dual unit until the end. The 2-level clustering has a The sixth and final example consists of 44 traits
very high life time of 338-50/338 = 85~o. specified for 114 occupational groupings, as obtained
Because of the relative dissimilarity between the Iris from (8). The occupations are listed in Table 1 and
setosa and the other two species, the model was run the traits are given in Table 2. The tree plot is given
again only for the Iris versicolor and the Iris virginica, in Fig. 9 and is self-explanatory as to the clusterings
now numbered 1-100. The tree plot for this sample obtained.
0o
o)
=;.
¢o
O)
O')
o;.
o,/
03
O~
(XI
o'~
o'~
o;.
t~
IE
t-..-~o)
P-.-o~,
t~D
O
d.
tD
Two Species of I r i s
Fig. 8.
C~
{o-
F-- O
Number Meaning
Education
! General educational development
2 Specific vocational preparation
Aptitude
3 Intelligence
4 Verbal
5 Numerical
6 Spatial
7 Form perception
8 Clerical perception
9 Motor coordination
10 Finger dexterity
11 Manual dexterity
12 Eye-hand-foot coordination
13 Color perception
Temperaments
14 Variety and change
15 Repetition and short cycle
l6 Under specific instruction
17 Direction, control, and planning
18 Dealing with people
19 Influencing people
20 Performing under stress
21 Sensory or judgmental criteria
22 Measurable or verifiable criteria
23 Interpretation of feelings, ideas, or facts
24 Set limits, tolerances, or standards
Interests
25 Things and objects
26 Business contact with people
27 Routine concrete, organized activities
28 Social welfare or dealing with people and language in
social situations
29 Prestige or esteem of others
30 People and the communication of ideas
31 Scientific or technical activities
32 Abstract or creative activities
33 Nonsocial activities-processes, machines, techniques
34 Tangible, productive satisfactions
Physical capabilities
35 Sedentary work
36 Light work
37 Medium work
38 Heavy work
39 Very heavy work
40 Climbing, balancing
41 Stooping, kneeling, crouching, crawling
42 Reaching, handling, fingering, feeling
43 Talking, hearing
44 Seeing
Many further illustrations could have been given, of the model. Usually, however, he will merely select
some extending the path of earlier models, others com- features that intuitively seem to be desirable in gen-
pletely divergent from earlier models. For example, eral, such as the Markovian property, the unit attrac-
there was no specific mention of a model with a dis- tion property, and the centroid-preservation property.
tance component being inversely proportional to the A significant amour of theoretical evaluation of the
absolute Euclidean distance, instead of to the Eucli- gravitational clustering method has been carried out,
dean distance squared. Nor was there any discussion although it will only be alluded to here. Wright t11'13~
about a model having an attractive force between two has proposed a formal model for cluster analysis that
masses being proportional to the sum of the two consists to a large extent of specifying several proper-
masses, i.e. to the total mass involved. ties that all legitimate clustering methods should
In general, a person who does gravitational cluster- satisfy. Because of the heuristic implementation of the
ing must necessarily choose the specific model he gravitational model, it is impossible to determine
wishes to use. Sometimes the particular characteristics theoretically whether or not one of the properties,
of his information will dictate some of the features regarding continuity, is satisfied. In addition, a
166 W.E. WRIGHT
straightforward extension of the model, regarding its 5. F. J. Rohlf, Adaptive hierarchical clustering Schemes,
"domain of definition," must be made in order to Syst. Zool. 19, 58-82 (1970).
satisfy another property. With these exceptions noted, 6. P. H. A. Sheath and R. R. Sokal, Numerical Taxonomy,
Freeman & Co., San Francisco (1973).
however, it has been shown theoretically that the unit 7. R. R. Sokal and P. H. A. Sheath, Principles of Numeri-
attraction Markovian model satisfies all other proper- cal Taxonomy, Freeman & Co., San Francisco (1963).
ties.t10) 8. United States. Department of Labor, Bureau of
Employment Security. Dictionary of Occupational
Titles, Vol. 2, Third Edition, pp. 226-528 (1965).
REFERENCES 9. J. H. Ward, Hierarchical groupings to optimize an
objective function, J. Am. Statist. Assoc. 58, 236-244
1. E. Anderson, The Species Problem in Iris, Ann. Mis- (1963).
souri Bot. Gardens 23, 457-509 (1936). 10. W. E. Wright, A Formalization of Cluster Analysis, and
2. J. S. Coleman, Clustering in n dimensions by use of Gravitational Clustering, Doctoral Dissertation, Wash-
a system of forces, J. Math. Sociol. 1, 1-47 (1970). ington University, St. Louis (1972).
3. R. A. Fisher, The use of multiple measurements in l 1. W. E. Wright, A formalization of cluster analysis, Pat-
taxonomic problems, Ann. Eugen. 7, Part II (1936). tern Recognition 5, 273-282 (1973).
4. D. Napior, Nonmetric multidimensional techniques for 12. W. E. Wright, Contour plot, Comm. A C M 17, (1974).
summated ratings, in Multidimensional Scaling, pp. 13. W. E. Wright, An axiomatic specification of Euclidean
157-158. Seminar Press, NY (1971). cluster analysis, Comput. J. 17, 355 364 (1974).
About the Autho~WILLIAM E. WRIGHTreceived the B.A. degree in mathematics from Southern Illinois
University in 1966, the M.A. degree in mathematics from the University of Illinois in 1968, and the
D.Sc degree in Applied Mathematics and Computer Science from Washington University in 1972.
He is presently an Assistant Professor in the Department of Computer Science at Southern Illinois
University-Carbondale. He has several publications in the area of cluster analysis, and is presently
doing research in the areas of minicomputers, information management, and simulation.
He is a member of the Association for Computing Machinery and ACM Special Interest Groups
on Minicomputers and Computer Science Education.