Sie sind auf Seite 1von 60

Data Clustering: A Review

A.K. JAIN
Michigan State University

M.N. MURTY
Indian Institute of Science

AND
P.J. FLYNN
The Ohio State University

Clustering is the unsupervised classification of patterns (observations, data items,


or feature vectors) into groups (clusters). The clustering problem has been
addressed in many contexts and by researchers in many disciplines; this reflects its
broad appeal and usefulness as one of the steps in exploratory data analysis.
However, clustering is a difficult problem combinatorially, and differences in
assumptions and contexts in different communities has made the transfer of useful
generic concepts and methodologies slow to occur. This paper presents an overview
of pattern clustering methods from a statistical pattern recognition perspective,
with a goal of providing useful advice and references to fundamental concepts
accessible to the broad community of clustering practitioners. We present a
taxonomy of clustering techniques, and identify cross-cutting themes and recent
advances. We also describe some important applications of clustering algorithms
such as image segmentation, object recognition, and information retrieval.

Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models; I.5.3


[Pattern Recognition]: Clustering; I.5.4 [Pattern Recognition]: Applications—
Computer vision; H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval—Clustering; I.2.6 [Artificial Intelligence]:
Learning—Knowledge acquisition
General Terms: Algorithms
Additional Key Words and Phrases: Cluster analysis, clustering applications,
exploratory data analysis, incremental clustering, similarity indices, unsupervised
learning

Section 6.1 is based on the chapter “Image Segmentation Using Clustering” by A.K. Jain and P.J.
Flynn, Advances in Image Understanding: A Festschrift for Azriel Rosenfeld (K. Bowyer and N. Ahuja,
Eds.), 1996 IEEE Computer Society Press, and is used by permission of the IEEE Computer Society.
Authors’ addresses: A. Jain, Department of Computer Science, Michigan State University, A714 Wells
Hall, East Lansing, MI 48824; M. Murty, Department of Computer Science and Automation, Indian
Institute of Science, Bangalore, 560 012, India; P. Flynn, Department of Electrical Engineering, The
Ohio State University, Columbus, OH 43210.
Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted
without fee provided that the copies are not made or distributed for profit or commercial advantage, the
copyright notice, the title of the publication, and its date appear, and notice is given that copying is by
permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to
lists, requires prior specific permission and / or a fee.
© 2000 ACM 0360-0300/99/0900–0001 $5.00

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 265

CONTENTS Intuitively, patterns within a valid clus-


ter are more similar to each other than
1. Introduction
1.1 Motivation they are to a pattern belonging to a
1.2 Components of a Clustering Task different cluster. An example of cluster-
1.3 The User’s Dilemma and the Role of Expertise ing is depicted in Figure 1. The input
1.4 History patterns are shown in Figure 1(a), and
1.5 Outline
2. Definitions and Notation
the desired clusters are shown in Figure
3. Pattern Representation, Feature Selection and 1(b). Here, points belonging to the same
Extraction cluster are given the same label. The
4. Similarity Measures variety of techniques for representing
5. Clustering Techniques
5.1 Hierarchical Clustering Algorithms
data, measuring proximity (similarity)
5.2 Partitional Algorithms between data elements, and grouping
5.3 Mixture-Resolving and Mode-Seeking data elements has produced a rich and
Algorithms often confusing assortment of clustering
5.4 Nearest Neighbor Clustering methods.
5.5 Fuzzy Clustering
5.6 Representation of Clusters
It is important to understand the dif-
5.7 Artificial Neural Networks for Clustering ference between clustering (unsuper-
5.8 Evolutionary Approaches for Clustering vised classification) and discriminant
5.9 Search-Based Approaches analysis (supervised classification). In
5.10 A Comparison of Techniques
5.11 Incorporating Domain Constraints in
supervised classification, we are pro-
Clustering vided with a collection of labeled (pre-
5.12 Clustering Large Data Sets classified) patterns; the problem is to
6. Applications label a newly encountered, yet unla-
6.1 Image Segmentation Using Clustering beled, pattern. Typically, the given la-
6.2 Object and Character Recognition
6.3 Information Retrieval
beled (training) patterns are used to
6.4 Data Mining learn the descriptions of classes which
7. Summary in turn are used to label a new pattern.
In the case of clustering, the problem is
to group a given collection of unlabeled
patterns into meaningful clusters. In a
1. INTRODUCTION sense, labels are associated with clus-
ters also, but these category labels are
1.1 Motivation data driven; that is, they are obtained
solely from the data.
Data analysis underlies many comput- Clustering is useful in several explor-
ing applications, either in a design atory pattern-analysis, grouping, deci-
phase or as part of their on-line opera- sion-making, and machine-learning sit-
tions. Data analysis procedures can be uations, including data mining,
dichotomized as either exploratory or document retrieval, image segmenta-
confirmatory, based on the availability tion, and pattern classification. How-
of appropriate models for the data ever, in many such problems, there is
source, but a key element in both types little prior information (e.g., statistical
of procedures (whether for hypothesis models) available about the data, and
formation or decision-making) is the the decision-maker must make as few
grouping, or classification of measure- assumptions about the data as possible.
ments based on either (i) goodness-of-fit It is under these restrictions that clus-
to a postulated model, or (ii) natural tering methodology is particularly ap-
groupings (clustering) revealed through propriate for the exploration of interre-
analysis. Cluster analysis is the organi- lationships among the data points to
zation of a collection of patterns (usual- make an assessment (perhaps prelimi-
ly represented as a vector of measure- nary) of their structure.
ments, or a point in a multidimensional The term “clustering” is used in sev-
space) into clusters based on similarity. eral research communities to describe

ACM Computing Surveys, Vol. 31, No. 3, September 1999


266 • A. Jain et al.

Y Y

x 4 4
x x x 4 4 5
x 5
x x x x x 4 4 4 5 5
x 3 55
x x x x x x 33 4
x 4 3 4
x x
4 4
x x 4
x 4 4 4
x x x

x x x x 2 2 6 7
x x x x 2 2 6 7
x x 6 7
x
xxx x x 1 11 1 6 7
x 1
X (b) X
(a)

Figure 1. Data clustering.

methods for grouping of unlabeled data. sionals (who should view it as an acces-
These communities have different ter- sible introduction to a mature field that
minologies and assumptions for the is making important contributions to
components of the clustering process computing application areas).
and the contexts in which clustering is
used. Thus, we face a dilemma regard- 1.2 Components of a Clustering Task
ing the scope of this survey. The produc-
tion of a truly comprehensive survey Typical pattern clustering activity in-
would be a monumental task given the volves the following steps [Jain and
sheer mass of literature in this area. Dubes 1988]:
The accessibility of the survey might (1) pattern representation (optionally
also be questionable given the need to including feature extraction and/or
reconcile very different vocabularies selection),
and assumptions regarding clustering
in the various communities. (2) definition of a pattern proximity
The goal of this paper is to survey the measure appropriate to the data do-
core concepts and techniques in the main,
large subset of cluster analysis with its (3) clustering or grouping,
roots in statistics and decision theory.
Where appropriate, references will be (4) data abstraction (if needed), and
made to key concepts and techniques (5) assessment of output (if needed).
arising from clustering methodology in
the machine-learning and other commu- Figure 2 depicts a typical sequencing of
nities. the first three of these steps, including
The audience for this paper includes a feedback path where the grouping
practitioners in the pattern recognition process output could affect subsequent
and image analysis communities (who feature extraction and similarity com-
should view it as a summarization of putations.
current practice), practitioners in the Pattern representation refers to the
machine-learning communities (who number of classes, the number of avail-
should view it as a snapshot of a closely able patterns, and the number, type,
related field with a rich history of well- and scale of the features available to the
understood techniques), and the clustering algorithm. Some of this infor-
broader audience of scientific profes- mation may not be controllable by the

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 267

Patterns Feature Pattern Interpattern


Selection/ Clusters
Similarity Grouping
Extraction Representations

feedback loop
Figure 2. Stages in clustering.

practitioner. Feature selection is the Data abstraction is the process of ex-


process of identifying the most effective tracting a simple and compact represen-
subset of the original features to use in tation of a data set. Here, simplicity is
clustering. Feature extraction is the use either from the perspective of automatic
of one or more transformations of the analysis (so that a machine can perform
input features to produce new salient further processing efficiently) or it is
features. Either or both of these tech- human-oriented (so that the representa-
niques can be used to obtain an appro- tion obtained is easy to comprehend and
priate set of features to use in cluster- intuitively appealing). In the clustering
ing. context, a typical data abstraction is a
Pattern proximity is usually measured compact description of each cluster,
by a distance function defined on pairs usually in terms of cluster prototypes or
of patterns. A variety of distance mea- representative patterns such as the cen-
sures are in use in the various commu- troid [Diday and Simon 1976].
nities [Anderberg 1973; Jain and Dubes How is the output of a clustering algo-
1988; Diday and Simon 1976]. A simple rithm evaluated? What characterizes a
distance measure like Euclidean dis- ‘good’ clustering result and a ‘poor’ one?
tance can often be used to reflect dis- All clustering algorithms will, when
similarity between two patterns, presented with data, produce clusters —
whereas other similarity measures can regardless of whether the data contain
be used to characterize the conceptual clusters or not. If the data does contain
similarity between patterns [Michalski clusters, some clustering algorithms
and Stepp 1983]. Distance measures are may obtain ‘better’ clusters than others.
discussed in Section 4. The assessment of a clustering proce-
The grouping step can be performed dure’s output, then, has several facets.
in a number of ways. The output clus- One is actually an assessment of the
tering (or clusterings) can be hard (a data domain rather than the clustering
partition of the data into groups) or algorithm itself— data which do not
fuzzy (where each pattern has a vari- contain clusters should not be processed
able degree of membership in each of by a clustering algorithm. The study of
the output clusters). Hierarchical clus- cluster tendency, wherein the input data
tering algorithms produce a nested se- are examined to see if there is any merit
ries of partitions based on a criterion for to a cluster analysis prior to one being
merging or splitting clusters based on performed, is a relatively inactive re-
similarity. Partitional clustering algo- search area, and will not be considered
rithms identify the partition that opti- further in this survey. The interested
mizes (usually locally) a clustering cri- reader is referred to Dubes [1987] and
terion. Additional techniques for the Cheng [1995] for information.
grouping operation include probabilistic Cluster validity analysis, by contrast,
[Brailovski 1991] and graph-theoretic is the assessment of a clustering proce-
[Zahn 1971] clustering methods. The dure’s output. Often this analysis uses a
variety of techniques for cluster forma- specific criterion of optimality; however,
tion is described in Section 5. these criteria are usually arrived at

ACM Computing Surveys, Vol. 31, No. 3, September 1999


268 • A. Jain et al.

subjectively. Hence, little in the way of —How can a vary large data set (say, a
‘gold standards’ exist in clustering ex- million patterns) be clustered effi-
cept in well-prescribed subdomains. Va- ciently?
lidity assessments are objective [Dubes
1993] and are performed to determine These issues have motivated this sur-
whether the output is meaningful. A vey, and its aim is to provide a perspec-
clustering structure is valid if it cannot tive on the state of the art in clustering
reasonably have occurred by chance or methodology and algorithms. With such
as an artifact of a clustering algorithm. a perspective, an informed practitioner
When statistical approaches to cluster- should be able to confidently assess the
ing are used, validation is accomplished tradeoffs of different techniques, and
by carefully applying statistical meth- ultimately make a competent decision
ods and testing hypotheses. There are on a technique or suite of techniques to
three types of validation studies. An employ in a particular application.
external assessment of validity com- There is no clustering technique that
pares the recovered structure to an a is universally applicable in uncovering
priori structure. An internal examina- the variety of structures present in mul-
tion of validity tries to determine if the tidimensional data sets. For example,
structure is intrinsically appropriate for consider the two-dimensional data set
the data. A relative test compares two shown in Figure 1(a). Not all clustering
structures and measures their relative techniques can uncover all the clusters
merit. Indices used for this comparison present here with equal facility, because
are discussed in detail in Jain and clustering algorithms often contain im-
Dubes [1988] and Dubes [1993], and are plicit assumptions about cluster shape
not discussed further in this paper. or multiple-cluster configurations based
on the similarity measures and group-
1.3 The User’s Dilemma and the Role of ing criteria used.
Expertise Humans perform competitively with
automatic clustering procedures in two
The availability of such a vast collection dimensions, but most real problems in-
of clustering algorithms in the litera- volve clustering in higher dimensions. It
ture can easily confound a user attempt- is difficult for humans to obtain an intu-
ing to select an algorithm suitable for itive interpretation of data embedded in
the problem at hand. In Dubes and Jain a high-dimensional space. In addition,
[1976], a set of admissibility criteria data hardly follow the “ideal” structures
defined by Fisher and Van Ness [1971] (e.g., hyperspherical, linear) shown in
are used to compare clustering algo- Figure 1. This explains the large num-
rithms. These admissibility criteria are ber of clustering algorithms which con-
based on: (1) the manner in which clus- tinue to appear in the literature; each
ters are formed, (2) the structure of the new clustering algorithm performs
data, and (3) sensitivity of the cluster- slightly better than the existing ones on
ing technique to changes that do not a specific distribution of patterns.
affect the structure of the data. How- It is essential for the user of a cluster-
ever, there is no critical analysis of clus- ing algorithm to not only have a thor-
tering algorithms dealing with the im- ough understanding of the particular
portant questions such as technique being utilized, but also to
—How should the data be normalized? know the details of the data gathering
process and to have some domain exper-
—Which similarity measure is appropri- tise; the more information the user has
ate to use in a given situation?
about the data at hand, the more likely
—How should domain knowledge be uti- the user would be able to succeed in
lized in a particular clustering prob- assessing its true class structure [Jain
lem? and Dubes 1988]. This domain informa-

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 269

tion can also be used to improve the survey of the state of the art in cluster-
quality of feature extraction, similarity ing circa 1978 was reported in Dubes
computation, grouping, and cluster rep- and Jain [1980]. A comparison of vari-
resentation [Murty and Jain 1995]. ous clustering algorithms for construct-
Appropriate constraints on the data ing the minimal spanning tree and the
source can be incorporated into a clus- short spanning path was given in Lee
tering procedure. One example of this is [1981]. Cluster analysis was also sur-
mixture resolving [Titterington et al. veyed in Jain et al. [1986]. A review of
1985], wherein it is assumed that the image segmentation by clustering was
data are drawn from a mixture of an reported in Jain and Flynn [1996]. Com-
unknown number of densities (often as- parisons of various combinatorial opti-
sumed to be multivariate Gaussian). mization schemes, based on experi-
The clustering problem here is to iden- ments, have been reported in Mishra
tify the number of mixture components and Raghavan [1994] and Al-Sultan and
and the parameters of each component. Khan [1996].
The concept of density clustering and a
methodology for decomposition of fea-
ture spaces [Bajcsy 1997] have also 1.5 Outline
been incorporated into traditional clus-
This paper is organized as follows. Sec-
tering methodology, yielding a tech-
tion 2 presents definitions of terms to be
nique for extracting overlapping clus-
used throughout the paper. Section 3
ters.
summarizes pattern representation,
feature extraction, and feature selec-
1.4 History tion. Various approaches to the compu-
tation of proximity between patterns
Even though there is an increasing in-
are discussed in Section 4. Section 5
terest in the use of clustering methods
presents a taxonomy of clustering ap-
in pattern recognition [Anderberg
proaches, describes the major tech-
1973], image processing [Jain and
niques in use, and discusses emerging
Flynn 1996] and information retrieval
techniques for clustering incorporating
[Rasmussen 1992; Salton 1991], cluster-
non-numeric constraints and the clus-
ing has a rich history in other disci-
tering of large sets of patterns. Section
plines [Jain and Dubes 1988] such as
6 discusses applications of clustering
biology, psychiatry, psychology, archae-
methods to image analysis and data
ology, geology, geography, and market-
mining problems. Finally, Section 7 pre-
ing. Other terms more or less synony-
sents some concluding remarks.
mous with clustering include
unsupervised learning [Jain and Dubes
1988], numerical taxonomy [Sneath and 2. DEFINITIONS AND NOTATION
Sokal 1973], vector quantization [Oehler
and Gray 1995], and learning by obser- The following terms and notation are
vation [Michalski and Stepp 1983]. The used throughout this paper.
field of spatial analysis of point pat-
terns [Ripley 1988] is also related to —A pattern (or feature vector, observa-
cluster analysis. The importance and tion, or datum) x is a single data item
interdisciplinary nature of clustering is used by the clustering algorithm. It
evident through its vast literature. typically consists of a vector of d mea-
A number of books on clustering have
been published [Jain and Dubes 1988; surements: x 5 ~ x 1 , . . . x d ! .
Anderberg 1973; Hartigan 1975; Spath
1980; Duran and Odell 1974; Everitt —The individual scalar components x i
1993; Backer 1995], in addition to some of a pattern x are called features (or
useful and influential review papers. A attributes).

ACM Computing Surveys, Vol. 31, No. 3, September 1999


270 • A. Jain et al.

—d is the dimensionality of the pattern clustering system. Because of the diffi-


or of the pattern space. culties surrounding pattern representa-
tion, it is conveniently assumed that the
—A pattern set is denoted - 5 pattern representation is available prior
$ x1 , . . . xn % . The i th pattern in - is to clustering. Nonetheless, a careful in-
denoted xi 5 ~ x i,1 , . . . x i,d ! . In many vestigation of the available features and
cases a pattern set to be clustered is any available transformations (even
simple ones) can yield significantly im-
viewed as an n 3 d pattern matrix. proved clustering results. A good pat-
—A class, in the abstract, refers to a tern representation can often yield a
state of nature that governs the pat- simple and easily understood clustering;
tern generation process in some cases. a poor pattern representation may yield
More concretely, a class can be viewed a complex clustering whose true struc-
as a source of patterns whose distri- ture is difficult or impossible to discern.
bution in feature space is governed by Figure 3 shows a simple example. The
a probability density specific to the points in this 2D feature space are ar-
class. Clustering techniques attempt ranged in a curvilinear cluster of ap-
to group patterns so that the classes proximately constant distance from the
thereby obtained reflect the different origin. If one chooses Cartesian coordi-
pattern generation processes repre- nates to represent the patterns, many
sented in the pattern set. clustering algorithms would be likely to
fragment the cluster into two or more
—Hard clustering techniques assign a clusters, since it is not compact. If, how-
class label l i to each patterns xi , iden- ever, one uses a polar coordinate repre-
tifying its class. The set of all labels sentation for the clusters, the radius
for a pattern set - is + 5 coordinate exhibits tight clustering and
$ l 1 , . . . l n % , with l i [ $ 1, · · ·, k % , a one-cluster solution is likely to be
easily obtained.
where k is the number of clusters.
A pattern can measure either a phys-
—Fuzzy clustering procedures assign to ical object (e.g., a chair) or an abstract
each input pattern xi a fractional de- notion (e.g., a style of writing). As noted
above, patterns are represented conven-
gree of membership f ij in each output
tionally as multidimensional vectors,
cluster j . where each dimension is a single fea-
—A distance measure (a specialization ture [Duda and Hart 1973]. These fea-
of a proximity measure) is a metric tures can be either quantitative or qual-
(or quasi-metric) on the feature space itative. For example, if weight and color
used to quantify the similarity of pat- are the two features used, then
terns. ~ 20, black ! is the representation of a
black object with 20 units of weight.
3. PATTERN REPRESENTATION, FEATURE The features can be subdivided into the
SELECTION AND EXTRACTION following types [Gowda and Diday
1992]:
There are no theoretical guidelines that
suggest the appropriate patterns and (1) Quantitative features: e.g.
features to use in a specific situation. (a) continuous values (e.g., weight);
Indeed, the pattern generation process (b) discrete values (e.g., the number
is often not directly controllable; the of computers);
user’s role in the pattern representation (c) interval values (e.g., the dura-
process is to gather facts and conjec- tion of an event).
tures about the data, optionally perform
feature selection and extraction, and de- (2) Qualitative features:
sign the subsequent elements of the (a) nominal or unordered (e.g., color);

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 271

tify a subset of the existing features for


subsequent use, while feature extrac-
tion techniques compute new features

.....................
.........
. ......
.. ............. from the original set. In either case, the
.................

................
. . ...........
................................ ..
goal is to improve classification perfor-
.............................. ....

mance and/or computational efficiency.


Feature selection is a well-explored

..... .
. ...

topic in statistical pattern recognition


[Duda and Hart 1973]; however, in a
clustering context (i.e., lacking class la-
bels for patterns), the feature selection
process is of necessity ad hoc, and might
involve a trial-and-error process where
Figure 3. A curvilinear cluster whose points various subsets of features are selected,
are approximately equidistant from the origin. the resulting patterns clustered, and
Different pattern representations (coordinate the output evaluated using a validity
systems) would cause clustering algorithms to
yield different results for this data (see text).
index. In contrast, some of the popular
feature extraction processes (e.g., prin-
cipal components analysis [Fukunaga
(b) ordinal (e.g., military rank or 1990]) do not depend on labeled data
qualitative evaluations of tem- and can be used directly. Reduction of
perature (“cool” or “hot”) or the number of features has an addi-
sound intensity (“quiet” or tional benefit, namely the ability to pro-
“loud”)). duce output that can be visually in-
spected by a human.
Quantitative features can be measured
on a ratio scale (with a meaningful ref- 4. SIMILARITY MEASURES
erence value, such as temperature), or
on nominal or ordinal scales. Since similarity is fundamental to the
One can also use structured features definition of a cluster, a measure of the
[Michalski and Stepp 1983] which are similarity between two patterns drawn
represented as trees, where the parent from the same feature space is essential
node represents a generalization of its to most clustering procedures. Because
child nodes. For example, a parent node of the variety of feature types and
“vehicle” may be a generalization of scales, the distance measure (or mea-
children labeled “cars,” “buses,” sures) must be chosen carefully. It is
“trucks,” and “motorcycles.” Further, most common to calculate the dissimi-
the node “cars” could be a generaliza- larity between two patterns using a dis-
tion of cars of the type “Toyota,” “Ford,” tance measure defined on the feature
“Benz,” etc. A generalized representa- space. We will focus on the well-known
tion of patterns, called symbolic objects distance measures used for patterns
was proposed in Diday [1988]. Symbolic whose features are all continuous.
objects are defined by a logical conjunc- The most popular metric for continu-
tion of events. These events link values ous features is the Euclidean distance
and features in which the features can
O ~x
d
take one or more values and all the
objects need not be defined on the same d 2~ x i, x j! 5 ~ i, k 2 xj, k!2!1/ 2
set of features. k51

It is often valuable to isolate only the


most descriptive and discriminatory fea- 5 i x i 2 x ji 2,
tures in the input set, and utilize those
features exclusively in subsequent anal- which is a special case (p 52) of the
ysis. Feature selection techniques iden- Minkowski metric

ACM Computing Surveys, Vol. 31, No. 3, September 1999


272 • A. Jain et al.

n ~ n 2 1 ! / 2 pairwise distance values


O ?x
d
d p~ x i, x j! 5 ~ i, k 2 xj, k?p!1/p for the n patterns and store them in a
k51 (symmetric) matrix.
Computation of distances between
5 i x i 2 x ji p. patterns with some or all features being
noncontinuous is problematic, since the
The Euclidean distance has an intuitive different types of features are not com-
appeal as it is commonly used to evalu- parable and (as an extreme example)
ate the proximity of objects in two or the notion of proximity is effectively bi-
three-dimensional space. It works well nary-valued for nominal-scaled fea-
when a data set has “compact” or “iso- tures. Nonetheless, practitioners (espe-
lated” clusters [Mao and Jain 1996]. cially those in machine learning, where
The drawback to direct use of the mixed-type patterns are common) have
Minkowski metrics is the tendency of
developed proximity measures for heter-
the largest-scaled feature to dominate
ogeneous type patterns. A recent exam-
the others. Solutions to this problem
ple is Wilson and Martinez [1997],
include normalization of the continuous
features (to a common range or vari- which proposes a combination of a mod-
ance) or other weighting schemes. Lin- ified Minkowski metric for continuous
ear correlation among features can also features and a distance based on counts
distort distance measures; this distor- (population) for nominal attributes. A
tion can be alleviated by applying a variety of other metrics have been re-
whitening transformation to the data or ported in Diday and Simon [1976] and
by using the squared Mahalanobis dis- Ichino and Yaguchi [1994] for comput-
tance ing the similarity between patterns rep-
resented using quantitative as well as
dM~x i, x j! 5 ~x i 2 x j!S21~x i 2 x j!T, qualitative features.
Patterns can also be represented us-
where the patterns xi and xj are as- ing string or tree structures [Knuth
1973]. Strings are used in syntactic
sumed to be row vectors, and S is the clustering [Fu and Lu 1977]. Several
sample covariance matrix of the pat- measures of similarity between strings
terns or the known covariance matrix of
are described in Baeza-Yates [1992]. A
the pattern generation process; d M ~ z , z ! good summary of similarity measures
assigns different weights to different between trees is given by Zhang [1995].
features based on their variances and A comparison of syntactic and statisti-
pairwise linear correlations. Here, it is cal approaches for pattern recognition
implicitly assumed that class condi- using several criteria was presented in
tional densities are unimodal and char- Tanaka [1995] and the conclusion was
acterized by multidimensional spread,
that syntactic methods are inferior in
i.e., that the densities are multivariate
every aspect. Therefore, we do not con-
Gaussian. The regularized Mahalanobis
sider syntactic methods further in this
distance was used in Mao and Jain
[1996] to extract hyperellipsoidal clus- paper.
ters. Recently, several researchers There are some distance measures re-
[Huttenlocher et al. 1993; Dubuisson ported in the literature [Gowda and
and Jain 1994] have used the Hausdorff Krishna 1977; Jarvis and Patrick 1973]
distance in a point set matching con- that take into account the effect of sur-
text. rounding or neighboring points. These
Some clustering algorithms work on a surrounding points are called context in
matrix of proximity values instead of on Michalski and Stepp [1983]. The simi-
the original pattern set. It is useful in larity between two points xi and xj ,
such situations to precompute all the given this context, is given by

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 273

X2 X2

C C

B B

A A
D
FE

X1 X1

Figure 4. A and B are more similar than A Figure 5. After a change in context, B and C
and C. are more similar than B and A.

Watanabe’s theorem of the ugly duck-


s ~ x i, x j! 5 f ~ x i, x j, % ! , ling [Watanabe 1985] states:
where % is the context (the set of sur- “Insofar as we use a finite set of
rounding points). One metric defined predicates that are capable of dis-
using context is the mutual neighbor tinguishing any two objects con-
distance (MND), proposed in Gowda and sidered, the number of predicates
Krishna [1977], which is given by shared by any two such objects is
constant, independent of the
MND~x i, x j! 5 NN~x i, x j! 1 NN~x j, x i!, choice of objects.”
This implies that it is possible to
where NN ~ xi , xj ! is the neighbor num- make any two arbitrary patterns
ber of xj with respect to xi . Figures 4 equally similar by encoding them with a
and 5 give an example. In Figure 4, the sufficiently large number of features. As
nearest neighbor of A is B, and B’s a consequence, any two arbitrary pat-
nearest neighbor is A. So, NN ~ A, B ! 5 terns are equally similar, unless we use
NN ~ B, A ! 5 1 and the MND between some additional domain information.
For example, in the case of conceptual
A and B is 2. However, NN ~ B, C ! 5 1
clustering [Michalski and Stepp 1983],
but NN ~ C, B ! 5 2 , and therefore
the similarity between xi and xj is de-
MND ~ B, C ! 5 3. Figure 5 was ob- fined as
tained from Figure 4 by adding three new
points D, E, and F. Now MND ~ B, C ! s~x i, x j! 5 f~x i, x j, #, %!,
5 3 (as before), but MND ~ A, B ! 5 5 .
The MND between A and B has in- where # is a set of pre-defined concepts.
creased by introducing additional This notion is illustrated with the help
points, even though A and B have not of Figure 6. Here, the Euclidean dis-
moved. The MND is not a metric (it does tance between points A and B is less
not satisfy the triangle inequality than that between B and C. However, B
[Zhang 1995]). In spite of this, MND has and C can be viewed as “more similar”
been successfully applied in several than A and B because B and C belong to
clustering applications [Gowda and Di- the same concept (ellipse) and A belongs
day 1992]. This observation supports to a different concept (rectangle). The
the viewpoint that the dissimilarity conceptual similarity measure is the
does not need to be a metric. most general similarity measure. We

ACM Computing Surveys, Vol. 31, No. 3, September 1999


274 • A. Jain et al.

x x x x x —Monothetic vs. polythetic: This aspect


x x relates to the sequential or simulta-
xx x
x neous use of features in the clustering
x Cx process. Most algorithms are polythe-
x x tic; that is, all features enter into the
x computation of distances between
x
x x B x
patterns, and decisions are based on
x x xx
x x x x x x x x x x x x x x
those distances. A simple monothetic
x Ax algorithm reported in Anderberg
x x
[1973] considers features sequentially
to divide the given collection of pat-
x x
terns. This is illustrated in Figure 8.
x x Here, the collection is divided into
x x
x x x x x x x x x x x x x x two groups using feature x 1 ; the verti-
Figure 6. Conceptual similarity be-
cal broken line V is the separating
tween points . line. Each of these clusters is further
divided independently using feature
x 2 , as depicted by the broken lines H 1
discuss several pragmatic issues associ- and H 2 . The major problem with this
ated with its use in Section 5. algorithm is that it generates 2 d clus-
ters where d is the dimensionality of
5. CLUSTERING TECHNIQUES the patterns. For large values of d
Different approaches to clustering data (d . 100 is typical in information re-
can be described with the help of the trieval applications [Salton 1991]),
hierarchy shown in Figure 7 (other tax- the number of clusters generated by
onometric representations of clustering this algorithm is so large that the
methodology are possible; ours is based data set is divided into uninterest-
on the discussion in Jain and Dubes ingly small and fragmented clusters.
[1988]). At the top level, there is a dis-
tinction between hierarchical and parti- —Hard vs. fuzzy: A hard clustering al-
tional approaches (hierarchical methods gorithm allocates each pattern to a
produce a nested series of partitions, single cluster during its operation and
while partitional methods produce only in its output. A fuzzy clustering
one). method assigns degrees of member-
The taxonomy shown in Figure 7 ship in several clusters to each input
must be supplemented by a discussion pattern. A fuzzy clustering can be
of cross-cutting issues that may (in converted to a hard clustering by as-
principle) affect all of the different ap- signing each pattern to the cluster
proaches regardless of their placement with the largest measure of member-
in the taxonomy. ship.
—Agglomerative vs. divisive: This as- —Deterministic vs. stochastic: This is-
pect relates to algorithmic structure sue is most relevant to partitional
and operation. An agglomerative ap- approaches designed to optimize a
proach begins with each pattern in a squared error function. This optimiza-
distinct (singleton) cluster, and suc- tion can be accomplished using tradi-
cessively merges clusters together un- tional techniques or through a ran-
til a stopping criterion is satisfied. A dom search of the state space
divisive method begins with all pat- consisting of all possible labelings.
terns in a single cluster and performs
splitting until a stopping criterion is —Incremental vs. non-incremental:
met. This issue arises when the pattern set

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 275

Clustering

Hierarchical Partitional

Single Complete Square Graph Mixture Mode


Link Link Error Theoretic Resolving Seeking

k-means Expectation
Maximization
Figure 7. A taxonomy of clustering approaches.

X2 V
to be clustered is large, and con-
straints on execution time or memory 222 2 2 3
space affect the architecture of the 2 22 3 333 3
2
2 2 2 2222 3
algorithm. The early history of clus- 2 2 H1 3 3
3
3 3 3 3
tering methodology does not contain 3 3
many examples of clustering algo- 1 1 1 1 1 3 3 3 H2
1 1
rithms designed to work with large 11 1 11 1 1
1
data sets, but the advent of data min- 1 1 1 1
1 4 4
4
1 1
1 11
ing has fostered the development of 4 4 4 4

clustering algorithms that minimize 4 4 4 4 4


4
the number of scans through the pat- 4
tern set, reduce the number of pat- 4
terns examined during execution, or X1
reduce the size of data structures Figure 8. Monothetic partitional clustering.
used in the algorithm’s operations.
A cogent observation in Jain and
Dubes [1988] is that the specification of points in Figure 9 (obtained from the
an algorithm for clustering usually single-link algorithm [Jain and Dubes
leaves considerable flexibilty in imple- 1988]) is shown in Figure 10. The den-
mentation. drogram can be broken at different lev-
els to yield different clusterings of the
5.1 Hierarchical Clustering Algorithms data.
Most hierarchical clustering algo-
The operation of a hierarchical cluster- rithms are variants of the single-link
ing algorithm is illustrated using the [Sneath and Sokal 1973], complete-link
two-dimensional data set in Figure 9. [King 1967], and minimum-variance
This figure depicts seven patterns la- [Ward 1963; Murtagh 1984] algorithms.
beled A, B, C, D, E, F, and G in three Of these, the single-link and complete-
clusters. A hierarchical algorithm yields link algorithms are most popular. These
a dendrogram representing the nested two algorithms differ in the way they
grouping of patterns and similarity lev- characterize the similarity between a
els at which groupings change. A den- pair of clusters. In the single-link
drogram corresponding to the seven method, the distance between two clus-

ACM Computing Surveys, Vol. 31, No. 3, September 1999


276 • A. Jain et al.

X2 tween patterns in the two clusters. In


F G either case, two clusters are merged to
Cluster3
form a larger cluster based on minimum
distance criteria. The complete-link al-
gorithm produces tightly bound or com-
pact clusters [Baeza-Yates 1992]. The
Cluster2 single-link algorithm, by contrast, suf-
Cluster1 fers from a chaining effect [Nagy 1968].
C DE

A
B It has a tendency to produce clusters
that are straggly or elongated. There
X1 are two clusters in Figures 12 and 13
separated by a “bridge” of noisy pat-
Figure 9. Points falling in three clusters.
terns. The single-link algorithm pro-
duces the clusters shown in Figure 12,
whereas the complete-link algorithm ob-
tains the clustering shown in Figure 13.
S The clusters obtained by the complete-
i
m link algorithm are more compact than
i those obtained by the single-link algo-
l
a rithm; the cluster labeled 1 obtained
r using the single-link algorithm is elon-
i
t gated because of the noisy patterns la-
y beled “*”. The single-link algorithm is
more versatile than the complete-link
algorithm, otherwise. For example, the
single-link algorithm can extract the
concentric clusters shown in Figure 11,
A B C D E F G
but the complete-link algorithm cannot.
Figure 10. The dendrogram obtained using However, from a pragmatic viewpoint, it
the single-link algorithm.
has been observed that the complete-
link algorithm produces more useful hi-
Y
erarchies in many applications than the
1 single-link algorithm [Jain and Dubes
1 1 1988].
2 2 Agglomerative Single-Link Clus-
1
1 2 tering Algorithm
2 2
2 1
1 (1) Place each pattern in its own clus-
1 1 ter. Construct a list of interpattern
distances for all distinct unordered
X pairs of patterns, and sort this list
Figure 11. Two concentric clusters. in ascending order.

(2) Step through the sorted list of dis-


tances, forming for each distinct dis-
ters is the minimum of the distances
between all pairs of patterns drawn similarity value d k a graph on the
from the two clusters (one pattern from patterns where pairs of patterns
the first cluster, the other from the sec- closer than d k are connected by a
ond). In the complete-link algorithm, graph edge. If all the patterns are
the distance between two clusters is the members of a connected graph, stop.
maximum of all pairwise distances be- Otherwise, repeat this step.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 277

X2 X2

2 2 2 2
1 111 2 2 22 2 1 111 2 2 22 2
2 2
1 1 11 2 1 1 11 2
2 2 11 2 2
11 1 1 1 *** * * * * ** 2 2
1 1 1 *** * * * * ** 2 2 2 2
1 1 2 2 1 1 1 2 2
1 1 1 1 2 1 1 1 2
1 1 2 1 1 2

X1 X1

Figure 12. A single-link clustering of a pattern Figure 13. A complete-link clustering of a pat-
set containing two classes (1 and 2) connected by tern set containing two classes (1 and 2) con-
a chain of noisy patterns (*). nected by a chain of noisy patterns (*).

(3) The output of the algorithm is a well-separated, chain-like, and concen-


nested hierarchy of graphs which tric clusters, whereas a typical parti-
can be cut at a desired dissimilarity tional algorithm such as the k -means
level forming a partition (clustering) algorithm works well only on data sets
identified by simply connected com- having isotropic clusters [Nagy 1968].
ponents in the corresponding graph. On the other hand, the time and space
complexities [Day 1992] of the parti-
Agglomerative Complete-Link Clus- tional algorithms are typically lower
tering Algorithm than those of the hierarchical algo-
(1) Place each pattern in its own clus- rithms. It is possible to develop hybrid
ter. Construct a list of interpattern algorithms [Murty and Krishna 1980]
distances for all distinct unordered that exploit the good features of both
pairs of patterns, and sort this list categories.
in ascending order. Hierarchical Agglomerative Clus-
tering Algorithm
(2) Step through the sorted list of dis-
tances, forming for each distinct dis- (1) Compute the proximity matrix con-
similarity value d k a graph on the taining the distance between each
patterns where pairs of patterns pair of patterns. Treat each pattern
as a cluster.
closer than d k are connected by a
graph edge. If all the patterns are (2) Find the most similar pair of clus-
members of a completely connected ters using the proximity matrix.
graph, stop. Merge these two clusters into one
cluster. Update the proximity ma-
(3) The output of the algorithm is a
trix to reflect this merge operation.
nested hierarchy of graphs which
can be cut at a desired dissimilarity (3) If all patterns are in one cluster,
level forming a partition (clustering) stop. Otherwise, go to step 2.
identified by completely connected
components in the corresponding Based on the way the proximity matrix
graph. is updated in step 2, a variety of ag-
glomerative algorithms can be designed.
Hierarchical algorithms are more ver- Hierarchical divisive algorithms start
satile than partitional algorithms. For with a single cluster of all the given
example, the single-link clustering algo- objects and keep splitting the clusters
rithm works well on data sets contain- based on some criterion to obtain a par-
ing non-isotropic clusters including tition of singleton clusters.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


278 • A. Jain et al.

5.2 Partitional Algorithms a convergence criterion is met (e.g.,


there is no reassignment of any pattern
A partitional clustering algorithm ob- from one cluster to another, or the
tains a single partition of the data in- squared error ceases to decrease signifi-
stead of a clustering structure, such as cantly after some number of iterations).
the dendrogram produced by a hierar-
chical technique. Partitional methods The k -means algorithm is popular be-
have advantages in applications involv- cause it is easy to implement, and its
ing large data sets for which the con- time complexity is O ~ n ! , where n is the
struction of a dendrogram is computa- number of patterns. A major problem
tionally prohibitive. A problem with this algorithm is that it is sensitive
accompanying the use of a partitional to the selection of the initial partition
algorithm is the choice of the number of and may converge to a local minimum of
desired output clusters. A seminal pa- the criterion function value if the initial
per [Dubes 1987] provides guidance on partition is not properly chosen. Figure
this key design decision. The partitional 14 shows seven two-dimensional pat-
techniques usually produce clusters by terns. If we start with patterns A, B,
optimizing a criterion function defined and C as the initial means around
either locally (on a subset of the pat- which the three clusters are built, then
terns) or globally (defined over all of the we end up with the partition {{A}, {B,
patterns). Combinatorial search of the C}, {D, E, F, G}} shown by ellipses. The
set of possible labelings for an optimum squared error criterion value is much
value of a criterion is clearly computa- larger for this partition than for the
tionally prohibitive. In practice, there- best partition {{A, B, C}, {D, E}, {F, G}}
fore, the algorithm is typically run mul- shown by rectangles, which yields the
tiple times with different starting global minimum value of the squared
states, and the best configuration ob- error criterion function for a clustering
tained from all of the runs is used as the containing three clusters. The correct
output clustering. three-cluster solution is obtained by
choosing, for example, A, D, and F as
5.2.1 Squared Error Algorithms. the initial cluster means.
The most intuitive and frequently used
criterion function in partitional cluster- Squared Error Clustering Method
ing techniques is the squared error cri- (1) Select an initial partition of the pat-
terion, which tends to work well with terns with a fixed number of clus-
isolated and compact clusters. The ters and cluster centers.
squared error for a clustering + of a (2) Assign each pattern to its closest
pattern set - (containing K clusters) is cluster center and compute the new
cluster centers as the centroids of
O O ix
K nj
~ j!
the clusters. Repeat this step until
e2~-, +! 5 i 2 c ji 2, convergence is achieved, i.e., until
j51 i51 the cluster membership is stable.
where xi~ j ! is the i th pattern belonging to (3) Merge and split clusters based on
some heuristic information, option-
the j th cluster and cj is the centroid of
ally repeating step 2.
the j th cluster.
The k -means is the simplest and most k -Means Clustering Algorithm
commonly used algorithm employing a
squared error criterion [McQueen 1967]. (1) Choose k cluster centers to coincide
It starts with a random initial partition with k randomly-chosen patterns or
and keeps reassigning the patterns to k randomly defined points inside
clusters based on the similarity between the hypervolume containing the pat-
the pattern and the cluster centers until tern set.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 279

tioning. ISODATA will first merge the


clusters {A} and {B,C} into one cluster
because the distance between their cen-
troids is small and then split the cluster
{D,E,F,G}, which has a large variance,
into two clusters {D,E} and {F,G}.
Another variation of the k -means al-
gorithm involves selecting a different
criterion function altogether. The dy-
namic clustering algorithm (which per-
mits representations other than the
centroid for each cluster) was proposed
in Diday [1973], and Symon [1977] and
Figure 14. The k -means algorithm is sensitive
to the initial partition. describes a dynamic clustering ap-
proach obtained by formulating the
clustering problem in the framework of
maximum-likelihood estimation. The
(2) Assign each pattern to the closest regularized Mahalanobis distance was
cluster center. used in Mao and Jain [1996] to obtain
hyperellipsoidal clusters.
(3) Recompute the cluster centers using
the current cluster memberships. 5.2.2 Graph-Theoretic Clustering.
(4) If a convergence criterion is not met, The best-known graph-theoretic divisive
go to step 2. Typical convergence clustering algorithm is based on con-
criteria are: no (or minimal) reas- struction of the minimal spanning tree
signment of patterns to new cluster (MST) of the data [Zahn 1971], and then
centers, or minimal decrease in deleting the MST edges with the largest
squared error. lengths to generate clusters. Figure 15
depicts the MST obtained from nine
Several variants [Anderberg 1973] of two-dimensional points. By breaking
the k -means algorithm have been re- the link labeled CD with a length of 6
ported in the literature. Some of them units (the edge with the maximum Eu-
attempt to select a good initial partition clidean length), two clusters ({A, B, C}
so that the algorithm is more likely to and {D, E, F, G, H, I}) are obtained. The
find the global minimum value. second cluster can be further divided
Another variation is to permit split- into two clusters by breaking the edge
ting and merging of the resulting clus- EF, which has a length of 4.5 units.
ters. Typically, a cluster is split when The hierarchical approaches are also
its variance is above a pre-specified related to graph-theoretic clustering.
threshold, and two clusters are merged Single-link clusters are subgraphs of
when the distance between their cen- the minimum spanning tree of the data
troids is below another pre-specified [Gower and Ross 1969] which are also
threshold. Using this variant, it is pos- the connected components [Gotlieb and
sible to obtain the optimal partition Kumar 1968]. Complete-link clusters
starting from any arbitrary initial parti- are maximal complete subgraphs, and
tion, provided proper threshold values are related to the node colorability of
are specified. The well-known ISO- graphs [Backer and Hubert 1976]. The
DATA [Ball and Hall 1965] algorithm maximal complete subgraph was consid-
employs this technique of merging and ered the strictest definition of a cluster
splitting clusters. If ISODATA is given in Augustson and Minker [1970] and
the “ellipse” partitioning shown in Fig- Raghavan and Yu [1981]. A graph-ori-
ure 14 as an initial partitioning, it will ented approach for non-hierarchical
produce the optimal three-cluster parti- structures and overlapping clusters is

ACM Computing Surveys, Vol. 31, No. 3, September 1999


280 • A. Jain et al.

X2 sible description of the technique. In the


H EM framework, the parameters of the
G
2 2 I component densities are unknown, as
2
F are the mixing parameters, and these
are estimated from the patterns. The
4.5
EM procedure begins with an initial
E estimate of the parameter vector and
B2 6 2.3 iteratively rescores the patterns against
C D the mixture density produced by the
A2
parameter vector. The rescored patterns
edge with the maximum length are then used to update the parameter
estimates. In a clustering context, the
X1 scores of the patterns (which essentially
Figure 15. Using the minimal spanning tree to measure their likelihood of being drawn
form clusters. from particular components of the mix-
ture) can be viewed as hints at the class
of the pattern. Those patterns, placed
presented in Ozawa [1985]. The Delau- (by their scores) in a particular compo-
nay graph (DG) is obtained by connect- nent, would therefore be viewed as be-
ing all the pairs of points that are longing to the same cluster.
Voronoi neighbors. The DG contains all Nonparametric techniques for densi-
the neighborhood information contained ty-based clustering have also been de-
in the MST and the relative neighbor- veloped [Jain and Dubes 1988]. Inspired
hood graph (RNG) [Toussaint 1980]. by the Parzen window approach to non-
parametric density estimation, the cor-
5.3 Mixture-Resolving and Mode-Seeking responding clustering procedure
Algorithms searches for bins with large counts in a
multidimensional histogram of the in-
The mixture resolving approach to clus- put pattern set. Other approaches in-
ter analysis has been addressed in a clude the application of another parti-
number of ways. The underlying as- tional or hierarchical clustering
sumption is that the patterns to be clus- algorithm using a distance measure
tered are drawn from one of several based on a nonparametric density esti-
distributions, and the goal is to identify mate.
the parameters of each and (perhaps)
their number. Most of the work in this 5.4 Nearest Neighbor Clustering
area has assumed that the individual
components of the mixture density are Since proximity plays a key role in our
Gaussian, and in this case the parame- intuitive notion of a cluster, nearest-
ters of the individual Gaussians are to neighbor distances can serve as the ba-
be estimated by the procedure. Tradi- sis of clustering procedures. An itera-
tional approaches to this problem in- tive procedure was proposed in Lu and
volve obtaining (iteratively) a maximum Fu [1978]; it assigns each unlabeled
likelihood estimate of the parameter pattern to the cluster of its nearest la-
vectors of the component densities [Jain beled neighbor pattern, provided the
and Dubes 1988]. distance to that labeled neighbor is be-
More recently, the Expectation Maxi- low a threshold. The process continues
mization (EM) algorithm (a general- until all patterns are labeled or no addi-
purpose maximum likelihood algorithm tional labelings occur. The mutual
[Dempster et al. 1977] for missing-data neighborhood value (described earlier in
problems) has been applied to the prob- the context of distance computation) can
lem of parameter estimation. A recent also be used to grow clusters from near
book [Mitchell 1997] provides an acces- neighbors.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 281

5.5 Fuzzy Clustering Y


Traditional clustering approaches gen-
erate partitions; in a partition, each
pattern belongs to one and only one F1 F2
cluster. Hence, the clusters in a hard
clustering are disjoint. Fuzzy clustering 7
extends this notion to associate each 3 4
8
pattern with every cluster using a mem- 1
6 9
bership function [Zadeh 1965]. The out- 2 5 H2
put of such algorithms is a clustering, H1
but not a partition. We give a high-level
partitional fuzzy clustering algorithm
below. X
Figure 16. Fuzzy clusters.
Fuzzy Clustering Algorithm
(1) Select an initial fuzzy partition of
the N objects into K clusters by have membership values in [0,1] for
selecting the N 3 K membership each cluster. For example, fuzzy cluster
matrix U. An element u ij of this F 1 could be compactly described as
matrix represents the grade of mem-
$~1,0.9!, ~2,0.8!, ~3,0.7!, ~4,0.6!, ~5,0.55!,
bership of object xi in cluster cj .
Typically, u ij [ @ 0,1 # . ~6,0.2!, ~7,0.2!, ~8,0.0!, ~9,0.0!%
(2) Using U, find the value of a fuzzy
criterion function, e.g., a weighted and F 2 could be described as
squared error criterion function, as-
sociated with the corresponding par- $~1,0.0!, ~2,0.0!, ~3,0.0!, ~4,0.1!, ~5,0.15!,
tition. One possible fuzzy criterion
function is ~6,0.4!, ~7,0.35!, ~8,1.0!, ~9,0.9!%

O O u ix 2 c i , The ordered pairs ~ i, m i ! in each cluster


N K
E2~-, U ! 5 ij i k
2
represent the i th pattern and its mem-
i51k51
bership value to the cluster m i . Larger
N membership values indicate higher con-
where ck 5 ( u ik xi is the k th fuzzy fidence in the assignment of the pattern
i51
cluster center. to the cluster. A hard clustering can be
Reassign patterns to clusters to re- obtained from a fuzzy partition by
thresholding the membership value.
duce this criterion function value
Fuzzy set theory was initially applied
and recompute U. to clustering in Ruspini [1969]. The
(3) Repeat step 2 until entries in U do book by Bezdek [1981] is a good source
not change significantly. for material on fuzzy clustering. The
most popular fuzzy clustering algorithm
In fuzzy clustering, each cluster is a is the fuzzy c -means (FCM) algorithm.
fuzzy set of all the patterns. Figure 16 Even though it is better than the hard
illustrates the idea. The rectangles en- k -means algorithm at avoiding local
close two “hard” clusters in the data: minima, FCM can still converge to local
H 1 5 $ 1,2,3,4,5 % and H 2 5 $ 6,7,8,9 % . minima of the squared error criterion.
A fuzzy clustering algorithm might pro- The design of membership functions is
duce the two fuzzy clusters F 1 and F 2 the most important problem in fuzzy
depicted by ellipses. The patterns will clustering; different choices include

ACM Computing Surveys, Vol. 31, No. 3, September 1999


282 • A. Jain et al.

X2 X2

* *
*
*
* *
* *
*
* * *
* *

X1 X1
By The Centroid By Three Distant Points
Figure 17. Representation of a cluster by points.

those based on similarity decomposition (2) Represent clusters using nodes in a


and centroids of clusters. A generaliza- classification tree. This is illus-
tion of the FCM algorithm was proposed trated in Figure 18.
by Bezdek [1981] through a family of
(3) Represent clusters by using conjunc-
objective functions. A fuzzy c -shell algo-
tive logical expressions. For example,
rithm and an adaptive variant for de-
tecting circular and elliptical bound- the expression @ X 1 . 3 #@ X 2 , 2 # in
aries was presented in Dave [1992]. Figure 18 stands for the logical state-
ment ‘X 1 is greater than 3’ and ’X 2 is
5.6 Representation of Clusters less than 2’.
In applications where the number of Use of the centroid to represent a
classes or clusters in a data set must be cluster is the most popular scheme. It
discovered, a partition of the data set is works well when the clusters are com-
the end product. Here, a partition gives pact or isotropic. However, when the
an idea about the separability of the clusters are elongated or non-isotropic,
data points into clusters and whether it then this scheme fails to represent them
is meaningful to employ a supervised properly. In such a case, the use of a
classifier that assumes a given number collection of boundary points in a clus-
of classes in the data set. However, in ter captures its shape well. The number
many other applications that involve of points used to represent a cluster
decision making, the resulting clusters should increase as the complexity of its
have to be represented or described in a shape increases. The two different rep-
compact form to achieve data abstrac- resentations illustrated in Figure 18 are
tion. Even though the construction of a equivalent. Every path in a classifica-
cluster representation is an important tion tree from the root node to a leaf
step in decision making, it has not been node corresponds to a conjunctive state-
examined closely by researchers. The ment. An important limitation of the
notion of cluster representation was in- typical use of the simple conjunctive
troduced in Duran and Odell [1974] and concept representations is that they can
was subsequently studied in Diday and describe only rectangular or isotropic
Simon [1976] and Michalski et al. clusters in the feature space.
[1981]. They suggested the following Data abstraction is useful in decision
representation schemes: making because of the following:
(1) Represent a cluster of points by (1) It gives a simple and intuitive de-
their centroid or by a set of distant scription of clusters which is easy
points in the cluster. Figure 17 de- for human comprehension. In both
picts these two ideas. conceptual clustering [Michalski

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 283

X2 |
|
3 3
5 |
|
3 X1< 3 X1 >3
3
1 | 3
4 3
1 1 1 | 3 X 2 <2 X2 >2
|
3 1 1 3 3
1 1|
1 3 1 2 3
1 | 3
2 1 1
1 1 |---------- Using Nodes in a Classification Tree
2 2
|
1 1 1 1 2 2 2
1 |
1 1 1 1 1 | 2 2
|
0
0 1 2 3 4 5 X1 1: [X1<3]; 2: [X1 >3][X2<2]; 3:[X1>3][X2 >2]
Using Conjunctive Statements
Figure 18. Representation of clusters by a classification tree or by conjunctive statements.

and Stepp 1983] and symbolic clus- by representing the subclusters by


tering [Gowda and Diday 1992] this their centroids.
representation is obtained without
using an additional step. These al- (3) It increases the efficiency of the de-
gorithms generate the clusters as cision making task. In a cluster-
well as their descriptions. A set of based document retrieval technique
fuzzy rules can be obtained from [Salton 1991], a large collection of
fuzzy clusters of a data set. These documents is clustered and each of
rules can be used to build fuzzy clas- the clusters is represented using its
sifiers and fuzzy controllers. centroid. In order to retrieve docu-
(2) It helps in achieving data compres- ments relevant to a query, the query
sion that can be exploited further by is matched with the cluster cen-
a computer [Murty and Krishna troids rather than with all the docu-
1980]. Figure 19(a) shows samples ments. This helps in retrieving rele-
belonging to two chain-like clusters vant documents efficiently. Also in
labeled 1 and 2. A partitional clus- several applications involving large
tering like the k -means algorithm data sets, clustering is used to per-
cannot separate these two struc- form indexing, which helps in effi-
tures properly. The single-link algo- cient decision making [Dorai and
rithm works well on this data, but is Jain 1995].
computationally expensive. So a hy-
brid approach may be used to ex-
ploit the desirable properties of both 5.7 Artificial Neural Networks for
these algorithms. We obtain 8 sub- Clustering
clusters of the data using the (com-
putationally efficient) k -means algo- Artificial neural networks (ANNs)
rithm. Each of these subclusters can [Hertz et al. 1991] are motivated by
be represented by their centroids as biological neural networks. ANNs have
shown in Figure 19(a). Now the sin- been used extensively over the past
gle-link algorithm can be applied on three decades for both classification and
these centroids alone to cluster clustering [Sethi and Jain 1991; Jain
them into 2 groups. The resulting and Mao 1994]. Some of the features of
groups are shown in Figure 19(b). the ANNs that are important in pattern
Here, a data reduction is achieved clustering are:

ACM Computing Surveys, Vol. 31, No. 3, September 1999


284 • A. Jain et al.

X2 X2 those in some classical clustering ap-


proaches. For example, the relationship
1 1 11
between the k -means algorithm and
2 2 2
11 1
1 1 11
1

2
2 2 22
222
1 2 LVQ is addressed in Pal et al. [1993].
1 1
1 1 1
11 1 1 1
1 1
2 2 2
2
2 2
2 1 2 The learning algorithm in ART models
2 2
111 11
1 11 1
1 2
2
2 2 2
2
1 2 is similar to the leader clustering algo-
1 1 11 1
1 1 1
22 2
2 2
2
1 2
rithm [Moor 1988].
2 2
1 1 22
The SOM gives an intuitively appeal-
ing two-dimensional map of the multidi-
(a)
X1
(b)
X1 mensional data set, and it has been
successfully used for vector quantiza-
Figure 19. Data compression by clustering.
tion and speech recognition [Kohonen
1984]. However, like its sequential
counterpart, the SOM generates a sub-
(1) ANNs process numerical vectors and optimal partition if the initial weights
so require patterns to be represented are not chosen properly. Further, its
using quantitative features only. convergence is controlled by various pa-
rameters such as the learning rate and
(2) ANNs are inherently parallel and a neighborhood of the winning node in
distributed processing architec- which learning takes place. It is possi-
tures. ble that a particular input pattern can
(3) ANNs may learn their interconnec- fire different output units at different
tion weights adaptively [Jain and iterations; this brings up the stability
Mao 1996; Oja 1982]. More specifi- issue of learning systems. The system is
cally, they can act as pattern nor- said to be stable if no pattern in the
malizers and feature selectors by training data changes its category after
appropriate selection of weights. a finite number of learning iterations.
This problem is closely associated with
Competitive (or winner–take–all) the problem of plasticity, which is the
neural networks [Jain and Mao 1996] ability of the algorithm to adapt to new
are often used to cluster input data. In data. For stability, the learning rate
competitive learning, similar patterns should be decreased to zero as iterations
are grouped by the network and repre- progress and this affects the plasticity.
sented by a single unit (neuron). This The ART models are supposed to be
grouping is done automatically based on stable and plastic [Carpenter and
data correlations. Well-known examples Grossberg 1990]. However, ART nets
of ANNs used for clustering include Ko- are order-dependent; that is, different
honen’s learning vector quantization partitions are obtained for different or-
(LVQ) and self-organizing map (SOM) ders in which the data is presented to
[Kohonen 1984], and adaptive reso- the net. Also, the size and number of
nance theory models [Carpenter and clusters generated by an ART net de-
Grossberg 1990]. The architectures of pend on the value chosen for the vigi-
these ANNs are simple: they are single- lance threshold, which is used to decide
layered. Patterns are presented at the whether a pattern is to be assigned to
input and are associated with the out- one of the existing clusters or start a
put nodes. The weights between the in- new cluster. Further, both SOM and
put nodes and the output nodes are ART are suitable for detecting only hy-
iteratively changed (this is called learn- perspherical clusters [Hertz et al. 1991].
ing) until a termination criterion is sat- A two-layer network that employs regu-
isfied. Competitive learning has been larized Mahalanobis distance to extract
found to exist in biological neural net- hyperellipsoidal clusters was proposed
works. However, the learning or weight in Mao and Jain [1994]. All these ANNs
update procedures are quite similar to use a fixed number of output nodes

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 285

which limit the number of clusters that parent1 1 0 1 1 0 1 0 1


can be produced.
crossover point
5.8 Evolutionary Approaches for
parent2 1 1 0 0 1 1 1 0
Clustering

Evolutionary approaches, motivated by


natural evolution, make use of evolu-
tionary operators and a population of
solutions to obtain the globally optimal child1 1 0 1 1 1 1 1 0
partition of the data. Candidate solu-
tions to the clustering problem are en-
coded as chromosomes. The most com- child2 1 1 0 0 0 1 0 1
monly used evolutionary operators are:
selection, recombination, and mutation. Figure 20. Crossover operation.
Each transforms one or more input
chromosomes into one or more output
chromosomes. A fitness function evalu- GAs, a selection operator propagates so-
ated on a chromosome determines a lutions from the current generation to
chromosome’s likelihood of surviving the next generation based on their fit-
into the next generation. We give below ness. Selection employs a probabilistic
a high-level description of an evolution- scheme so that solutions with higher
ary algorithm applied to clustering. fitness have a higher probability of get-
An Evolutionary Algorithm for ting reproduced.
Clustering There are a variety of recombination
operators in use; crossover is the most
(1) Choose a random population of solu- popular. Crossover takes as input a pair
tions. Each solution here corre- of chromosomes (called parents) and
sponds to a valid k -partition of the outputs a new pair of chromosomes
data. Associate a fitness value with (called children or offspring) as depicted
each solution. Typically, fitness is in Figure 20. In Figure 20, a single
inversely proportional to the point crossover operation is depicted. It
squared error value. A solution with exchanges the segments of the parents
a small squared error will have a across a crossover point. For example,
larger fitness value. in Figure 20, the parents are the binary
strings ‘10110101’ and ‘11001110’. The
(2) Use the evolutionary operators se- segments in the two parents after the
lection, recombination and mutation crossover point (between the fourth and
to generate the next population of fifth locations) are exchanged to pro-
solutions. Evaluate the fitness val- duce the child chromosomes. Mutation
ues of these solutions. takes as input a chromosome and out-
(3) Repeat step 2 until some termina- puts a chromosome by complementing
tion condition is satisfied. the bit value at a randomly selected
location in the input chromosome. For
The best-known evolutionary tech- example, the string ‘11111110’ is gener-
niques are genetic algorithms (GAs) ated by applying the mutation operator
[Holland 1975; Goldberg 1989], evolu- to the second bit location in the string
tion strategies (ESs) [Schwefel 1981], ‘10111110’ (starting at the left). Both
and evolutionary programming (EP) crossover and mutation are applied with
[Fogel et al. 1965]. Out of these three some prespecified probabilities which
approaches, GAs have been most fre- depend on the fitness values.
quently used in clustering. Typically, GAs represent points in the search
solutions are binary strings in GAs. In space as binary strings, and rely on the

ACM Computing Surveys, Vol. 31, No. 3, September 1999


286 • A. Jain et al.

crossover operator to explore the search f(X)


space. Mutation is used in GAs for the
sake of completeness, that is, to make
sure that no part of the search space is
left unexplored. ESs and EP differ from
X
the GAs in solution representation and X
X
X

type of the mutation operator used; EP


does not use a recombination operator,
but only selection and mutation. Each of
these three approaches have been used
to solve the clustering problem by view-
ing it as a minimization of the squared S S S S
X
error criterion. Some of the theoretical 1 3 4 2

issues such as the convergence of these Figure 21. GAs perform globalized search.
approaches were studied in Fogel and
Fogel [1994].
GAs perform a globalized search for
solutions whereas most other clustering S 4 5 11000 . The corresponding deci-
procedures perform a localized search. mal values are 15 and 24, respectively.
In a localized search, the solution ob- Similarly, by mutating the most signifi-
tained at the ‘next iteration’ of the pro- cant bit in the binary string 01111 (dec-
cedure is in the vicinity of the current imal 15), the binary string 11111 (deci-
mal 31) is generated. These jumps, or
solution. In this sense, the k -means al- gaps between points in successive gen-
gorithm, fuzzy clustering algorithms, erations, are much larger than those
ANNs used for clustering, various an- produced by other approaches.
nealing schemes (see below), and tabu Perhaps the earliest paper on the use
search are all localized search tech- of GAs for clustering is by Raghavan
niques. In the case of GAs, the crossover and Birchand [1979], where a GA was
and mutation operators can produce used to minimize the squared error of a
new solutions that are completely dif- clustering. Here, each point or chromo-
ferent from the current ones. We illus-
trate this fact in Figure 21. Let us as- some represents a partition of N objects
sume that the scalar X is coded using a into K clusters and is represented by a
5-bit binary representation, and let S 1 K -ary string of length N . For example,
and S 2 be two points in the one-dimen- consider six patterns—A, B, C, D, E,
sional search space. The decimal values and F—and the string 101001. This six-
of S 1 and S 2 are 8 and 31, respectively. bit binary (K 5 2 ) string corresponds to
placing the six patterns into two clus-
Their binary representations are S 1 5 ters. This string represents a two-parti-
01000 and S 2 5 11111 . Let us apply tion, where one cluster has the first,
the single-point crossover to these third, and sixth patterns and the second
strings, with the crossover site falling cluster has the remaining patterns. In
between the second and third most sig- other words, the two clusters are
nificant bits as shown below. {A,C,F} and {B,D,E} (the six-bit binary
string 010110 represents the same clus-
01!000 tering of the six patterns). When there
are K clusters, there are K ! different
11!111 chromosomes corresponding to each
K -partition of the data. This increases
This will produce a new pair of points or the effective search space size by a fac-
chromosomes S 3 and S 4 as shown in tor of K !. Further, if crossover is applied
Figure 21. Here, S 3 5 01111 and on two good chromosomes, the resulting

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 287

offspring may be inferior in this repre- tion. This hybrid approach performed
sentation. For example, let {A,B,C} and better than the GA.
{D,E,F} be the clusters in the optimal A major problem with GAs is their
2-partition of the six patterns consid- sensitivity to the selection of various
ered above. The corresponding chromo- parameters such as population size,
somes are 111000 and 000111. By ap- crossover and mutation probabilities,
plying single-point crossover at the etc. Grefenstette [Grefenstette 1986]
location between the third and fourth has studied this problem and suggested
bit positions on these two strings, we guidelines for selecting these control pa-
get 111111 and 000000 as offspring and rameters. However, these guidelines
both correspond to an inferior partition. may not yield good results on specific
These problems have motivated re- problems like pattern clustering. It was
searchers to design better representa- reported in Jones and Beltramo [1991]
tion schemes and crossover operators. that hybrid genetic algorithms incorpo-
In Bhuyan et al. [1991], an improved rating problem-specific heuristics are
representation scheme is proposed good for clustering. A similar claim is
made in Davis [1991] about the applica-
where an additional separator symbol is
bility of GAs to other practical prob-
used along with the pattern labels to
lems. Another issue with GAs is the
represent a partition. Let the separator
selection of an appropriate representa-
symbol be represented by *. Then the
tion which is low in order and short in
chromosome ACF*BDE corresponds to a defining length.
2-partition {A,C,F} and {B,D,E}. Using It is possible to view the clustering
this representation permits them to problem as an optimization problem
map the clustering problem into a per- that locates the optimal centroids of the
mutation problem such as the traveling clusters directly rather than finding an
salesman problem, which can be solved optimal partition using a GA. This view
by using the permutation crossover op- permits the use of ESs and EP, because
erators [Goldberg 1989]. This solution centroids can be coded easily in both
also suffers from permutation redun- these approaches, as they support the
dancy. There are 72 equivalent chromo- direct representation of a solution as a
somes (permutations) corresponding to real-valued vector. In Babu and Murty
the same partition of the data into the [1994], ESs were used on both hard and
two clusters {A,C,F} and {B,D,E}. fuzzy clustering problems and EP has
More recently, Jones and Beltramo been used to evolve fuzzy min-max clus-
[1991] investigated the use of edge- ters [Fogel and Simpson 1993]. It has
based crossover [Whitley et al. 1989] to been observed that they perform better
solve the clustering problem. Here, all than their classical counterparts, the
patterns in a cluster are assumed to k -means algorithm and the fuzzy
form a complete graph by connecting c -means algorithm. However, all of
them with edges. Offspring are gener- these approaches suffer (as do GAs and
ated from the parents so that they in- ANNs) from sensitivity to control pa-
herit the edges from their parents. It is rameter selection. For each specific
observed that this crossover operator problem, one has to tune the parameter
takes O ~ K 6 1 N ! time for N patterns values to suit the application.
and K clusters ruling out its applicabil-
ity on practical data sets having more
than 10 clusters. In a hybrid approach 5.9 Search-Based Approaches
proposed in Babu and Murty [1993], the Search techniques used to obtain the
GA is used only to find good initial optimum value of the criterion function
cluster centers and the k -means algo- are divided into deterministic and sto-
rithm is applied to find the final parti- chastic search techniques. Determinis-

ACM Computing Surveys, Vol. 31, No. 3, September 1999


288 • A. Jain et al.

tic search techniques guarantee an opti- quality (as measured by the criterion
mal partition by performing exhaustive function). The probability of acceptance
enumeration. On the other hand, the is governed by a critical parameter
stochastic search techniques generate a called the temperature (by analogy with
near-optimal partition reasonably annealing in metals), which is typically
quickly, and guarantee convergence to specified in terms of a starting (first
optimal partition asymptotically. iteration) and final temperature value.
Among the techniques considered so far, Selim and Al-Sultan [1991] studied the
evolutionary approaches are stochastic effects of control parameters on the per-
and the remainder are deterministic. formance of the algorithm, and Baeza-
Other deterministic approaches to clus- Yates [1992] used SA to obtain near-
tering include the branch-and-bound optimal partition of the data. SA is
technique adopted in Koontz et al. statistically guaranteed to find the glo-
[1975] and Cheng [1995] for generating bal optimal solution [Aarts and Korst
optimal partitions. This approach gen- 1989]. A high-level outline of a SA
erates the optimal partition of the data based algorithm for clustering is given
at the cost of excessive computational below.
requirements. In Rose et al. [1993], a
deterministic annealing approach was Clustering Based on Simulated
proposed for clustering. This approach Annealing
employs an annealing technique in (1) Randomly select an initial partition
which the error surface is smoothed, but and P 0 , and compute the squared
convergence to the global optimum is
not guaranteed. The use of determinis- error value, E P 0 . Select values for
tic annealing in proximity-mode cluster- the control parameters, initial and
ing (where the patterns are specified in final temperatures T 0 and T f .
terms of pairwise proximities rather (2) Select a neighbor P 1 of P 0 and com-
than multidimensional points) was ex-
plored in Hofmann and Buhmann pute its squared error value, E P 1 . If
[1997]; later work applied the determin- E P 1 is larger than E P 0 , then assign
istic annealing approach to texture seg- P 1 to P 0 with a temperature-depen-
mentation [Hofmann and Buhmann dent probability. Else assign P 1 to
1998]. P 0 . Repeat this step for a fixed num-
The deterministic approaches are typ- ber of iterations.
ically greedy descent approaches,
whereas the stochastic approaches per- (3) Reduce the value of T 0 , i.e. T 0 5
mit perturbations to the solutions in cT 0 , where c is a predetermined
non-locally optimal directions also with constant. If T 0 is greater than T f ,
nonzero probabilities. The stochastic then go to step 2. Else stop.
search techniques are either sequential
or parallel, while evolutionary ap- The SA algorithm can be slow in
proaches are inherently parallel. The reaching the optimal solution, because
simulated annealing approach (SA) optimal results require the temperature
[Kirkpatrick et al. 1983] is a sequential to be decreased very slowly from itera-
stochastic search technique, whose ap- tion to iteration.
plicability to clustering is discussed in Tabu search [Glover 1986], like SA, is
Klein and Dubes [1989]. Simulated an- a method designed to cross boundaries
nealing procedures are designed to of feasibility or local optimality and to
avoid (or recover from) solutions which systematically impose and release con-
correspond to local optima of the objec- straints to permit exploration of other-
tive functions. This is accomplished by wise forbidden regions. Tabu search
accepting with some probability a new was used to solve the clustering prob-
solution for the next iteration of lower lem in Al-Sultan [1995].

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 289

5.10 A Comparison of Techniques and Khan [1996]. TS, GA and SA were


judged comparable in terms of solution
In this section we have examined vari-
quality, and all were better than
ous deterministic and stochastic search
techniques to approach the clustering k -means. However, the k -means method
problem as an optimization problem. A is the most efficient in terms of execu-
majority of these methods use the tion time; other schemes took more time
squared error criterion function. Hence, (by a factor of 500 to 2500) to partition a
the partitions generated by these ap- data set of size 60 into 5 clusters. Fur-
proaches are not as versatile as those ther, GA encountered the best solution
generated by hierarchical algorithms. faster than TS and SA; SA took more
The clusters generated are typically hy- time than TS to encounter the best solu-
perspherical in shape. Evolutionary ap- tion. However, GA took the maximum
proaches are globalized search tech- time for convergence, that is, to obtain a
niques, whereas the rest of the population of only the best solutions,
approaches are localized search tech- followed by TS and SA. An important
nique. ANNs and GAs are inherently observation is that in both Mishra and
parallel, so they can be implemented Raghavan [1994] and Al-Sultan and
using parallel hardware to improve Khan [1996] the sizes of the data sets
their speed. Evolutionary approaches considered are small; that is, fewer than
are population-based; that is, they 200 patterns.
search using more than one solution at A two-layer network was employed in
a time, and the rest are based on using Mao and Jain [1996], with the first
a single solution at a time. ANNs, GAs, layer including a number of principal
SA, and Tabu search (TS) are all sensi- component analysis subnets, and the
tive to the selection of various learning/ second layer using a competitive net.
control parameters. In theory, all four of This network performs partitional clus-
these methods are weak methods [Rich tering using the regularized Mahalano-
1983] in that they do not use explicit bis distance. This net was trained using
domain knowledge. An important fea- a set of 1000 randomly selected pixels
ture of the evolutionary approaches is from a large image and then used to
that they can find the optimal solution classify every pixel in the image. Babu
even when the criterion function is dis- et al. [1997] proposed a stochastic con-
continuous. nectionist approach (SCA) and com-
An empirical study of the perfor- pared its performance on standard data
mance of the following heuristics for sets with both the SA and k -means algo-
clustering was presented in Mishra and rithms. It was observed that SCA is
Raghavan [1994]; SA, GA, TS, random- superior to both SA and k -means in
ized branch-and-bound (RBA) [Mishra terms of solution quality. Evolutionary
and Raghavan 1994], and hybrid search approaches are good only when the data
(HS) strategies [Ismail and Kamel 1989] size is less than 1000 and for low di-
were evaluated. The conclusion was mensional data.
that GA performs well in the case of In summary, only the k -means algo-
one-dimensional data, while its perfor- rithm and its ANN equivalent, the Ko-
mance on high dimensional data sets is honen net [Mao and Jain 1996] have
not impressive. The performance of SA been applied on large data sets; other
is not attractive because it is very slow. approaches have been tested, typically,
RBA and TS performed best. HS is good on small data sets. This is because ob-
for high dimensional data. However, taining suitable learning/control param-
none of the methods was found to be eters for ANNs, GAs, TS, and SA is
superior to others by a significant mar- difficult and their execution times are
gin. An empirical study of k -means, SA, very high for large data sets. However,
TS, and GA was presented in Al-Sultan it has been shown [Selim and Ismail

ACM Computing Surveys, Vol. 31, No. 3, September 1999


290 • A. Jain et al.

1984] that the k -means method con- Every clustering algorithm uses some
verges to a locally optimal solution. This type of knowledge either implicitly or
behavior is linked with the initial seed explicitly. Implicit knowledge plays a
selection in the k -means algorithm. So role in (1) selecting a pattern represen-
if a good initial partition can be ob- tation scheme (e.g., using one’s prior
tained quickly using any of the other experience to select and encode fea-
techniques, then k -means would work tures), (2) choosing a similarity measure
well even on problems with large data (e.g., using the Mahalanobis distance
sets. Even though various methods dis- instead of the Euclidean distance to ob-
cussed in this section are comparatively tain hyperellipsoidal clusters), and (3)
weak, it was revealed through experi- selecting a grouping scheme (e.g., speci-
mental studies that combining domain fying the k -means algorithm when it is
knowledge would improve their perfor- known that clusters are hyperspheri-
mance. For example, ANNs work better cal). Domain knowledge is used implic-
in classifying images represented using itly in ANNs, GAs, TS, and SA to select
extracted features than with raw im- the control/learning parameter values
ages, and hybrid classifiers work better that affect the performance of these al-
than ANNs [Mohiuddin and Mao 1994]. gorithms.
Similarly, using domain knowledge to It is also possible to use explicitly
hybridize a GA improves its perfor- available domain knowledge to con-
mance [Jones and Beltramo 1991]. So it strain or guide the clustering process.
may be useful in general to use domain Such specialized clustering algorithms
knowledge along with approaches like have been used in several applications.
GA, SA, ANN, and TS. However, these Domain concepts can play several roles
approaches (specifically, the criteria
in the clustering process, and a variety
functions used in them) have a tendency
of choices are available to the practitio-
to generate a partition of hyperspheri-
ner. At one extreme, the available do-
cal clusters, and this could be a limita-
tion. For example, in cluster-based doc- main concepts might easily serve as an
ument retrieval, it was observed that additional feature (or several), and the
the hierarchical algorithms performed remainder of the procedure might be
better than the partitional algorithms otherwise unaffected. At the other ex-
[Rasmussen 1992]. treme, domain concepts might be used
to confirm or veto a decision arrived at
independently by a traditional cluster-
5.11 Incorporating Domain Constraints in ing algorithm, or used to affect the com-
Clustering putation of distance in a clustering algo-
rithm employing proximity. The
As a task, clustering is subjective in
incorporation of domain knowledge into
nature. The same data set may need to
be partitioned differently for different clustering consists mainly of ad hoc ap-
purposes. For example, consider a proaches with little in common; accord-
whale, an elephant, and a tuna fish ingly, our discussion of the idea will
[Watanabe 1985]. Whales and elephants consist mainly of motivational material
form a cluster of mammals. However, if and a brief survey of past work. Ma-
the user is interested in partitioning chine learning research and pattern rec-
them based on the concept of living in ognition research intersect in this topi-
water, then whale and tuna fish are cal area, and the interested reader is
clustered together. Typically, this sub- referred to the prominent journals in
jectivity is incorporated into the cluster- machine learning (e.g., Machine Learn-
ing criterion by incorporating domain ing, J. of AI Research, or Artificial Intel-
knowledge in one or more phases of ligence) for a fuller treatment of this
clustering. topic.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 291

As documented in Cheng and Fu conceptual clustering and numerical


[1985], rules in an expert system may taxonomy are not diametrically oppo-
be clustered to reduce the size of the site, but are equivalent. In the case of
knowledge base. This modification of conceptual clustering, domain knowl-
clustering was also explored in the do- edge is explicitly used in interpattern
mains of universities, congressional vot- similarity computation, whereas in nu-
ing records, and terrorist events by Leb- merical taxonomy it is implicitly as-
owitz [1987]. sumed that pattern representations are
5.11.1 Similarity Computation. Con- obtained using the domain knowledge.
ceptual knowledge was used explicitly
in the similarity computation phase in 5.11.3 Cluster Descriptions. Typi-
Michalski and Stepp [1983]. It was as- cally, in knowledge-based clustering,
sumed that the pattern representations both the clusters and their descriptions
were available and the dynamic cluster- or characterizations are generated
ing algorithm [Diday 1973] was used to [Fisher and Langley 1985]. There are
group patterns. The clusters formed some exceptions, for instance,, Gowda
were described using conjunctive state- and Diday [1992], where only clustering
ments in predicate logic. It was stated
is performed and no descriptions are
in Stepp and Michalski [1986] and
generated explicitly. In conceptual clus-
Michalski and Stepp [1983] that the
tering, a cluster of objects is described
groupings obtained by the conceptual
clustering are superior to those ob- by a conjunctive logical expression
tained by the numerical methods for [Michalski and Stepp 1983]. Even
clustering. A critical analysis of that though a conjunctive statement is one of
work appears in Dale [1985], and it was the most common descriptive forms
observed that monothetic divisive clus- used by humans, it is a limited form. In
tering algorithms generate clusters that Shekar et al. [1987], functional knowl-
can be described by conjunctive state- edge of objects was used to generate
ments. For example, consider Figure 8. more intuitively appealing cluster de-
Four clusters in this figure, obtained scriptions that employ the Boolean im-
using a monothetic algorithm, can be plication operator. A system that repre-
described by using conjunctive concepts sents clusters probabilistically was
as shown below: described in Fisher [1987]; these de-
scriptions are more general than con-
Cluster 1: @ X # a# ∧ @ Y # b# junctive concepts, and are well-suited to
hierarchical classification domains (e.g.,
Cluster 2: @ X # a# ∧ @ Y . b# the animal species hierarchy). A concep-
Cluster 3: @ X . a# ∧ @ Y . c# tual clustering system in which cluster-
ing is done first is described in Fisher
Cluster 4: @ X . a# ∧ @ Y # c# and Langley [1985]. These clusters are
then described using probabilities. A
where ∧ is the Boolean conjunction similar scheme was described in Murty
(‘and’) operator, and a, b, and c are and Jain [1995], but the descriptions
constants. are logical expressions that employ both
5.11.2 Pattern Representation. It was conjunction and disjunction.
shown in Srivastava and Murty [1990] An important characteristic of concep-
that by using knowledge in the pattern tual clustering is that it is possible to
representation phase, as is implicitly group objects represented by both qual-
done in numerical taxonomy ap- itative and quantitative features if the
proaches, it is possible to obtain the clustering leads to a conjunctive con-
same partitions as those generated by cept. For example, the concept cricket
conceptual clustering. In this sense, ball might be represented as

ACM Computing Surveys, Vol. 31, No. 3, September 1999


292 • A. Jain et al.

cooking In some domains, complete knowledge


is available explicitly. For example, the
ACM Computing Reviews classification
heating liquid holding tree used in Murty and Jain [1995] is
complete and is explicitly available for
use. In several domains, knowledge is
electric ...
water ... metallic ... incomplete and is not available explic-
itly. Typically, machine learning tech-
niques are used to automatically extract
Figure 22. Functional knowledge. knowledge, which is a difficult and chal-
lenging problem. The most prominently
used learning method is “learning from
color 5 red ∧ ~shape 5 sphere ! examples” [Quinlan 1990]. This is an
inductive learning scheme used to ac-
∧ ~make 5 leather !
quire knowledge from examples of each
∧ ~radius 5 1.4 inches !, of the classes in different domains. Even
if the knowledge is available explicitly,
where radius is a quantitative feature it is difficult to find out whether it is
and the rest are all qualitative features. complete and sound. Further, it is ex-
This description is used to describe a tremely difficult to verify soundness
cluster of cricket balls. In Stepp and and completeness of knowledge ex-
Michalski [1986], a graph (the goal de- tracted from practical data sets, be-
pendency network) was used to group cause such knowledge cannot be repre-
structured objects. In Shekar et al. sented in propositional logic. It is
[1987] functional knowledge was used possible that both the data and knowl-
to group man-made objects. Functional edge keep changing with time. For ex-
knowledge was represented using ample, in a library, new books might get
and/or trees [Rich 1983]. For example, added and some old books might be
the function cooking shown in Figure 22 deleted from the collection with time.
can be decomposed into functions like Also, the classification system (knowl-
holding and heating the material in a edge) employed by the library is up-
liquid medium. Each man-made object dated periodically.
has a primary function for which it is A major problem with knowledge-
produced. Further, based on its fea- based clustering is that it has not been
tures, it may serve additional functions. applied to large data sets or in domains
For example, a book is meant for read- with large knowledge bases. Typically,
ing, but if it is heavy then it can also be the number of objects grouped was less
used as a paper weight. In Sutton et al. than 1000, and number of rules used as
[1993], object functions were used to a part of the knowledge was less than
construct generic recognition systems. 100. The most difficult problem is to use
5.11.4 Pragmatic Issues. Any imple- a very large knowledge base for cluster-
mentation of a system that explicitly ing objects in several practical problems
incorporates domain concepts into a including data mining, image segmenta-
clustering technique has to address the tion, and document retrieval.
following important pragmatic issues:
5.12 Clustering Large Data Sets
(1) Representation, availability and
completeness of domain concepts. There are several applications where it
(2) Construction of inferences using the is necessary to cluster a large collection
knowledge. of patterns. The definition of ‘large’ has
varied (and will continue to do so) with
(3) Accommodation of changing or dy- changes in technology (e.g., memory and
namic knowledge. processing time). In the 1960s, ‘large’

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 293

meant several thousand patterns [Ross Table I. Complexity of Clustering Algorithms


1968]; now, there are applications Time Space
Clustering Algorithm
where millions of patterns of high di- Complexity Complexity
mensionality have to be clustered. For leader O ~ kn ! O~k!
example, to segment an image of size k -means O ~ nkl ! O~k!
500 3 500 pixels, the number of pixels ISODATA O ~ nkl ! O~k!
to be clustered is 250,000. In document shortest spanning path O ~ n 2! O~n!
retrieval and information filtering, mil- single-line O ~ n 2 log n ! O ~ n 2!
lions of patterns with a dimensionality complete-line O ~ n 2 log n ! O ~ n 2!
of more than 100 have to be clustered to
achieve data abstraction. A majority of
the approaches and algorithms pro-
posed in the literature cannot handle data irrespective of the order in
such large data sets. Approaches based which the patterns are presented to
on genetic algorithms, tabu search and the algorithm.
simulated annealing are optimization However, the k -means algorithm is sen-
techniques and are restricted to reason- sitive to initial seed selection and even
ably small data sets. Implementations in the best case, it can produce only
of conceptual clustering optimize some hyperspherical clusters.
criterion functions and are typically Hierarchical algorithms are more ver-
computationally expensive. satile. But they have the following dis-
The convergent k -means algorithm advantages:
and its ANN equivalent, the Kohonen
net, have been used to cluster large (1) The time complexity of hierarchical
data sets [Mao and Jain 1996]. The rea- agglomerative algorithms is O ~ n 2
sons behind the popularity of the log n ! [Kurita 1991]. It is possible
k -means algorithm are: to obtain single-link clusters using
an MST of the data, which can be
(1) Its time complexity is O ~ nkl ! ,
constructed in O ~ n log2 n ! time for
where n is the number of patterns, two-dimensional data [Choudhury
k is the number of clusters, and l is and Murty 1990].
the number of iterations taken by
the algorithm to converge. Typi- (2) The space complexity of agglomera-
cally, k and l are fixed in advance tive algorithms is O ~ n 2 ! . This is be-
and so the algorithm has linear time cause a similarity matrix of size
complexity in the size of the data set n 3 n has to be stored. To cluster
[Day 1992]. every pixel in a 100 3 100 image,
approximately 200 megabytes of
(2) Its space complexity is O ~ k 1 n ! . It
storage would be required (assuning
requires additional space to store single-precision storage of similari-
the data matrix. It is possible to ties). It is possible to compute the
store the data matrix in a secondary entries of this matrix based on need
memory and access each pattern instead of storing them (this would
based on need. However, this increase the algorithm’s time com-
scheme requires a huge access time plexity [Anderberg 1973]).
because of the iterative nature of
the algorithm, and as a consequence Table I lists the time and space com-
processing time increases enor- plexities of several well-known algo-
mously. rithms. Here, n is the number of pat-
(3) It is order-independent; for a given terns to be clustered, k is the number of
initial seed set of cluster centers, it clusters, and l is the number of itera-
generates the same partition of the tions.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


294 • A. Jain et al.

A possible solution to the problem of anced Iterative Reducing and Cluster-


clustering large data sets while only ing) stores summary information about
marginally sacrificing the versatility of candidate clusters in a dynamic tree
clusters is to implement more efficient data structure. This tree hierarchically
variants of clustering algorithms. A hy- organizes the clusterings represented at
brid approach was used in Ross [1968], the leaf nodes. The tree can be rebuilt
where a set of reference points is chosen when a threshold specifying cluster size
as in the k -means algorithm, and each is updated manually, or when memory
of the remaining data points is assigned constraints force a change in this
to one or more reference points or clus- threshold. This algorithm, like CLAR-
ters. Minimal spanning trees (MST) are ANS, has a time complexity linear in
obtained for each group of points sepa- the number of patterns.
rately. These MSTs are merged to form The algorithms discussed above work
an approximate global MST. This ap- on large data sets, where it is possible
proach computes similarities between to accommodate the entire pattern set
only a fraction of all possible pairs of in the main memory. However, there
points. It was shown that the number of are applications where the entire data
similarities computed for 10,000 pat- set cannot be stored in the main mem-
terns using this approach is the same as ory because of its size. There are cur-
the total number of pairs of points in a rently three possible approaches to
collection of 2,000 points. Bentley and solve this problem.
Friedman [1978] contains an algorithm (1) The pattern set can be stored in a
that can compute an approximate MST secondary memory and subsets of
in O ~ n log n ! time. A scheme to gener- this data clustered independently,
ate an approximate dendrogram incre- followed by a merging step to yield a
mentally in O ~ n log n ! time was pre- clustering of the entire pattern set.
sented in Zupan [1982], while We call this approach the divide and
Venkateswarlu and Raju [1992] pro- conquer approach.
posed an algorithm to speed up the ISO- (2) An incremental clustering algorithm
DATA clustering algorithm. A study of can be employed. Here, the entire
the approximate single-linkage cluster data matrix is stored in a secondary
analysis of large data sets was reported memory and data items are trans-
in Eddy et al. [1994]. In that work, an ferred to the main memory one at a
approximate MST was used to form sin- time for clustering. Only the cluster
gle-link clusters of a data set of size representations are stored in the
40,000. main memory to alleviate the space
The emerging discipline of data min- limitations.
ing (discussed as an application in Sec-
tion 6) has spurred the development of (3) A parallel implementation of a clus-
new algorithms for clustering large data tering algorithm may be used. We
sets. Two algorithms of note are the discuss these approaches in the next
CLARANS algorithm developed by Ng three subsections.
and Han [1994] and the BIRCH algo- 5.12.1 Divide and Conquer Approach.
rithm proposed by Zhang et al. [1996]. Here, we store the entire pattern matrix
CLARANS (Clustering Large Applica-
tions based on RANdom Search) identi- of size n 3 d in a secondary storage
fies candidate cluster centroids through space (e.g., a disk file). We divide this
analysis of repeated random samples data into p blocks, where an optimum
from the original data. Because of the value of p can be chosen based on the
use of random sampling, the time com- clustering algorithm used [Murty and
plexity is O ~ n ! for a pattern set of n Krishna 1980]. Let us assume that we
elements. The BIRCH algorithm (Bal- have n / p patterns in each of the blocks.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 295

x x x pk Table II. Number of Distance Computations (n)


x x x
for the Single-Link Clustering Algorithm and a
x x x Two-Level Divide and Conquer Algorithm
n Single-link p Two-level

xxx x x
100 4,950 1200
x x xx x xx x x xx x xx x x x xxx 500 124,750 2 10,750
x x x
...
x n 100 499,500 4 31,500
--
xx
xxx x xx x x
xxx
p 10,000 49,995,000 10 1,013,750
x

1 2 p
Figure 23. Divide and conquer approach to 5.12.2 Incremental Clustering. In-
clustering. cremental clustering is based on the
assumption that it is possible to con-
sider patterns one at a time and assign
them to existing clusters. Here, a new
data item is assigned to a cluster with-
We transfer each of these blocks to the
out affecting the existing clusters signif-
main memory and cluster it into k clus- icantly. A high level description of a
ters using a standard algorithm. One or typical incremental clustering algo-
more representative samples from each rithm is given below.
of these clusters are stored separately;
we have pk of these representative pat- An Incremental Clustering Algo-
terns if we choose one representative rithm
per cluster. These pk representatives (1) Assign the first data item to a clus-
are further clustered into k clusters and ter.
the cluster labels of these representa- (2) Consider the next data item. Either
tive patterns are used to relabel the assign this item to one of the exist-
original pattern matrix. We depict this ing clusters or assign it to a new
two-level algorithm in Figure 23. It is cluster. This assignment is done
possible to extend this algorithm to any based on some criterion, e.g. the dis-
number of levels; more levels are re- tance between the new item and the
quired if the data set is very large and existing cluster centroids.
the main memory size is very small (3) Repeat step 2 till all the data items
[Murty and Krishna 1980]. If the single- are clustered.
link algorithm is used to obtain 5 clus-
ters, then there is a substantial savings The major advantage with the incre-
in the number of computations as mental clustering algorithms is that it
shown in Table II for optimally chosen p is not necessary to store the entire pat-
when the number of clusters is fixed at tern matrix in the memory. So, the
5. However, this algorithm works well space requirements of incremental algo-
only when the points in each block are rithms are very small. Typically, they
reasonably homogeneous which is often are noniterative. So their time require-
satisfied by image data. ments are also small. There are several
A two-level strategy for clustering a incremental clustering algorithms:
data set containing 2,000 patterns was (1) The leader clustering algorithm
described in Stahl [1986]. In the first [Hartigan 1975] is the simplest in
level, the data set is loosely clustered terms of time complexity which is
into a large number of clusters using O ~ nk ! . It has gained popularity be-
the leader algorithm. Representatives cause of its neural network imple-
from these clusters, one per cluster, are mentation, the ART network [Car-
the input to the second level clustering, penter and Grossberg 1990]. It is
which is obtained using Ward’s hierar- very easy to implement as it re-
chical method.
quires only O ~ k ! space.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


296 • A. Jain et al.

(2) The shortest spanning path (SSP) Y


algorithm [Slagle et al. 1975] was
originally proposed for data reorga-
nization and was successfully used
in automatic auditing of records 3 4
[Lee et al. 1978]. Here, SSP algo-
rithm was used to cluster 2000 pat- 5
2
terns using 18 features. These clus-
ters are used to estimate missing
feature values in data items and to 1 6
identify erroneous feature values.
(3) The cobweb system [Fisher 1987] is X
an incremental conceptual cluster- Figure 24. The leader algorithm is order
ing algorithm. It has been success- dependent.
fully used in engineering applica-
tions [Fisher et al. 1993]. strates that a combination of algorith-
(4) An incremental clustering algorithm mic enhancements to a clustering
for dynamic information processing algorithm and distribution of the com-
was presented in Can [1993]. The putations over a network of worksta-
motivation behind this work is that, tions can allow an entire 512 3 512
in dynamic databases, items might image to be clustered in a few minutes.
get added and deleted over time. Depending on the clustering algorithm
These changes should be reflected in in use, parallelization of the code and
the partition generated without sig- replication of data for efficiency may
nificantly affecting the current clus- yield large benefits. However, a global
ters. This algorithm was used to shared data structure, namely the clus-
cluster incrementally an INSPEC ter membership table, remains and
database of 12,684 documents corre- must be managed centrally or replicated
sponding to computer science and and synchronized periodically. The
electrical engineering. presence or absence of robust, efficient
parallel clustering techniques will de-
Order-independence is an important termine the success or failure of cluster
property of clustering algorithms. An analysis in large-scale data mining ap-
algorithm is order-independent if it gen- plications in the future.
erates the same partition for any order
in which the data is presented. Other- 6. APPLICATIONS
wise, it is order-dependent. Most of the
incremental algorithms presented above Clustering algorithms have been used
are order-dependent. We illustrate this in a large variety of applications [Jain
order-dependent property in Figure 24 and Dubes 1988; Rasmussen 1992;
where there are 6 two-dimensional ob- Oehler and Gray 1995; Fisher et al.
jects labeled 1 to 6. If we present these 1993]. In this section, we describe sev-
patterns to the leader algorithm in the eral applications where clustering has
order 2,1,3,5,4,6 then the two clusters been employed as an essential step.
obtained are shown by ellipses. If the These areas are: (1) image segmenta-
order is 1,2,6,4,5,3, then we get a two- tion, (2) object and character recogni-
partition as shown by the triangles. The tion, (3) document retrieval, and (4)
SSP algorithm, cobweb, and the algo- data mining.
rithm in Can [1993] are all order-depen-
dent. 6.1 Image Segmentation Using Clustering
5.12.3 Parallel Implementation. Re- Image segmentation is a fundamental
cent work [Judd et al. 1996] demon- component in many computer vision

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 297

x
3

x
2

x
1
Figure 25. Feature representation for clustering. Image measurements and positions are transformed
to features. Clusters in feature space correspond to image segments.

applications, and can be addressed as a is the input image with N r rows and N c
clustering problem [Rosenfeld and Kak columns and measurement value x ij at
1982]. The segmentation of the image(s)
presented to an image analysis system pixel ~ i, j ! , then the segmentation can
is critically dependent on the scene to be expressed as 6 5 $ S 1 , . . . S k % , with
be sensed, the imaging geometry, con- the l th segment
figuration, and sensor used to transduce
the scene into a digital image, and ulti- Sl 5 $~il1, jl1!, . . . ~ilN , jlN !%
l l

mately the desired output (goal) of the


system. consisting of a connected subset of the
The applicability of clustering meth- pixel coordinates. No two segments
odology to the image segmentation share any pixel locations (S i ù S j 5 À
problem was recognized over three de- @i Þ j ), and the union of all segments
cades ago, and the paradigms underly- covers the entire image ~ ø i51 k
Si 5
ing the initial pioneering efforts are still
in use today. A recurring theme is to $ 1. . . N r % 3 $ 1. . . N c %! . Jain and
define feature vectors at every image Dubes [1988], after Fu and Mui [1981]
location (pixel) composed of both func- identified three techniques for produc-
tions of image intensity and functions of ing segmentations from input imagery:
the pixel location itself. This basic idea, region-based, edge-based, or cluster-
depicted in Figure 25, has been success- based.
fully used for intensity images (with or Consider the use of simple gray level
without texture), range (depth) images thresholding to segment a high-contrast
and multispectral images. intensity image. Figure 26(a) shows a
grayscale image of a textbook’s bar code
6.1.1 Segmentation. An image seg- scanned on a flatbed scanner. Part b
mentation is typically defined as an ex- shows the results of a simple threshold-
haustive partitioning of an input image ing operation designed to separate the
into regions, each of which is considered dark and light regions in the bar code
to be homogeneous with respect to some area. Binarization steps like this are
image property of interest (e.g., inten- often performed in character recogni-
sity, color, or texture) [Jain et al. 1995]. tion systems. Thresholding in effect
If ‘clusters’ the image pixels into two
groups based on the one-dimensional
( 5 $xij, i 5 1. . . Nr, j 5 1. . . Nc% intensity measurement [Rosenfeld 1969;

ACM Computing Surveys, Vol. 31, No. 3, September 1999


298 • A. Jain et al.

’x.dat’

0 50 100 150 200 250 300

(a) (b)

(c)
Figure 26. Binarization via thresholding. (a): Original grayscale image. (b): Gray-level histogram. (c):
Results of thresholding.

Dunn et al. 1974]. A postprocessing step 6.1.2 Image Segmentation Via Clus-
separates the classes into connected re- tering. The application of local feature
gions. While simple gray level thresh- clustering to segment gray–scale images
olding is adequate in some carefully was documented in Schachter et al.
controlled image acquisition environ- [1979]. This paper emphasized the ap-
ments and much research has been de- propriate selection of features at each
voted to appropriate methods for pixel rather than the clustering method-
thresholding [Weszka 1978; Trier and ology, and proposed the use of image
Jain 1995], complex images require plane coordinates (spatial information)
more elaborate segmentation tech- as additional features to be employed in
niques. clustering-based segmentation. The goal
Many segmenters use measurements of clustering was to obtain a sequence of
which are both spectral (e.g., the multi-
hyperellipsoidal clusters starting with
spectral scanner used in remote sens-
cluster centers positioned at maximum
ing) and spatial (based on the pixel’s
density locations in the pattern space,
location in the image plane). The mea-
surement at each pixel hence corre- and growing clusters about these cen-
sponds directly to our concept of a pat- ters until a x 2 test for goodness of fit
tern. was violated. A variety of features were

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 299

discussed and applied to both grayscale In Vinod et al. [1994], two neural
and color imagery. networks are designed to perform pat-
An agglomerative clustering algo- tern clustering when combined. A two-
rithm was applied in Silverman and layer network operates on a multidi-
Cooper [1988] to the problem of unsu- mensional histogram of the data to
pervised learning of clusters of coeffi- identify ‘prototypes’ which are used to
cient vectors for two image models that classify the input patterns into clusters.
correspond to image segments. The first These prototypes are fed to the classifi-
image model is polynomial for the ob- cation network, another two-layer net-
served image measurements; the as- work operating on the histogram of the
sumption here is that the image is a input data, but are trained to have dif-
collection of several adjoining graph fering weights from the prototype selec-
surfaces, each a polynomial function of tion network. In both networks, the his-
togram of the image is used to weight
the image plane coordinates, which are
the contributions of patterns neighbor-
sampled on the raster grid to produce
ing the one under consideration to the
the observed image. The algorithm pro-
location of prototypes or the ultimate
ceeds by obtaining vectors of coefficients classification; as such, it is likely to be
of least-squares fits to the data in M more robust when compared to tech-
disjoint image windows. An agglomera- niques which assume an underlying
tive clustering algorithm merges (at parametric density function for the pat-
each step) the two clusters that have a tern classes. This architecture was
minimum global between-cluster Ma- tested on gray-scale and color segmen-
halanobis distance. The same frame- tation problems.
work was applied to segmentation of Jolion et al. [1991] describe a process
textured images, but for such images for extracting clusters sequentially from
the polynomial model was inappropri- the input pattern set by identifying hy-
ate, and a parameterized Markov Ran- perellipsoidal regions (bounded by loci
dom Field model was assumed instead. of constant Mahalanobis distance)
Wu and Leahy [1993] describe the which contain a specified fraction of the
application of the principles of network unclassified points in the set. The ex-
flow to unsupervised classification, tracted regions are compared against
yielding a novel hierarchical algorithm the best-fitting multivariate Gaussian
for clustering. In essence, the technique density through a Kolmogorov-Smirnov
views the unlabeled patterns as nodes test, and the fit quality is used as a
in a graph, where the weight of an edge figure of merit for selecting the ‘best’
region at each iteration. The process
(i.e., its capacity) is a measure of simi-
continues until a stopping criterion is
larity between the corresponding nodes.
satisfied. This procedure was applied to
Clusters are identified by removing
the problems of threshold selection for
edges from the graph to produce con- multithreshold segmentation of inten-
nected disjoint subgraphs. In image seg- sity imagery and segmentation of range
mentation, pixels which are 4-neighbors imagery.
or 8-neighbors in the image plane share Clustering techniques have also been
edges in the constructed adjacency successfully used for the segmentation
graph, and the weight of a graph edge is of range images, which are a popular
based on the strength of a hypothesized source of input data for three-dimen-
image edge between the pixels involved sional object recognition systems [Jain
(this strength is calculated using simple and Flynn 1993]. Range sensors typi-
derivative masks). Hence, this seg- cally return raster images with the
menter works by finding closed contours measured value at each pixel being the
in the image, and is best labeled edge- coordinates of a 3D location in space.
based rather than region-based. These 3D positions can be understood

ACM Computing Surveys, Vol. 31, No. 3, September 1999


300 • A. Jain et al.

as the locations where rays emerging The CLUSTER algorithm [Jain and
from the image plane locations in a bun- Dubes 1988] was used to obtain seg-
dle intersect the objects in front of the ment labels for each pixel. CLUSTER is
sensor. an enhancement of the k -means algo-
The local feature clustering concept is rithm; it has the ability to identify sev-
particularly attractive for range image eral clusterings of a data set, each with
segmentation since (unlike intensity a different number of clusters. Hoffman
measurements) the measurements at and Jain [1987] also experimented with
each pixel have the same units (length); other clustering techniques (e.g., com-
this would make ad hoc transformations plete-link, single-link, graph-theoretic,
or normalizations of the image features
and other squared error algorithms) and
unnecessary if their goal is to impose
found CLUSTER to provide the best
equal scaling on those features. How-
combination of performance and accu-
ever, range image segmenters often add
additional measurements to the feature racy. An additional advantage of CLUS-
space, removing this advantage. TER is that it produces a sequence of
A range image segmentation system output clusterings (i.e., a 2-cluster solu-
described in Hoffman and Jain [1987] tion up through a K max -cluster solution
employs squared error clustering in a where K max is specified by the user and
six-dimensional feature space as a is typically 20 or so); each clustering in
source of an “initial” segmentation this sequence yields a clustering statis-
which is refined (typically by merging tic which combines between-cluster sep-
segments) into the output segmenta- aration and within-cluster scatter. The
tion. The technique was enhanced in clustering that optimizes this statistic
Flynn and Jain [1991] and used in a is chosen as the best one. Each pixel in
recent systematic comparison of range the range image is assigned the seg-
image segmenters [Hoover et al. 1996]; ment label of the nearest cluster center.
as such, it is probably one of the long- This minimum distance classification
est-lived range segmenters which has step is not guaranteed to produce seg-
performed well on a large variety of ments which are connected in the image
range images. plane; therefore, a connected compo-
This segmenter works as follows. At nents labeling algorithm allocates new
each pixel ~ i, j ! in the input range im- labels for disjoint regions that were
age, the corresponding 3D measurement placed in the same cluster. Subsequent
is denoted ~ x ij , y ij , z ij ! , where typically operations include surface type tests,
x ij is a linear function of j (the column merging of adjacent patches using a test
number) and y ij is a linear function of i for the presence of crease or jump edges
(the row number). A k 3 k neighbor- between adjacent segments, and surface
parameter estimation.
hood of ~ i, j ! is used to estimate the 3D
Figure 27 shows this processing ap-
surface normal nij 5 ~ n xij , n yij , n zij ! at plied to a range image. Part a of the
~ i, j ! , typically by finding the least- figure shows the input range image;
squares planar fit to the 3D points in part b shows the distribution of surface
the neighborhood. The feature vector for normals. In part c, the initial segmenta-
the pixel at ~ i, j ! is the six-dimensional tion returned by CLUSTER and modi-
measurement ~ x ij , y ij , z ij , n xij , n yij , n zij ! , fied to guarantee connected segments is
and a candidate segmentation is found shown. Part d shows the final segmen-
by clustering these feature vectors. For tation produced by merging adjacent
practical reasons, not every pixel’s fea- patches which do not have a significant
ture vector is used in the clustering crease edge between them. The final
procedure; typically 1000 feature vec- clusters reasonably represent distinct
tors are chosen by subsampling. surfaces present in this complex object.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 301

(a) (b)

(c) (d)
Figure 27. Range image segmentation using clustering. (a): Input range image. (b): Surface normals
for selected image pixels. (c): Initial segmentation (19 cluster solution) returned by CLUSTER using
1000 six-dimensional samples from the image as a pattern set. (d): Final segmentation (8 segments)
produced by postprocessing.

The analysis of textured images has clusters as well as the fuzzy member-
been of interest to researchers for sev- ship of each feature vector to the vari-
eral years. Texture segmentation tech- ous clusters.
niques have been developed using a va- A system for segmenting texture im-
riety of texture models and image ages was described in Jain and Far-
operations. In Nguyen and Cohen rokhnia [1991]; there, Gabor filters
[1993], texture image segmentation was were used to obtain a set of 28 orienta-
addressed by modeling the image as a tion- and scale-selective features that
hierarchy of two Markov Random characterize the texture in the neigh-
Fields, obtaining some simple statistics borhood of each pixel. These 28 features
from each image block to form a feature are reduced to a smaller number
vector, and clustering these blocks us- through a feature selection procedure,
ing a fuzzy K -means clustering method. and the resulting features are prepro-
The clustering procedure here is modi- cessed and then clustered using the
fied to jointly estimate the number of CLUSTER program. An index statistic

ACM Computing Surveys, Vol. 31, No. 3, September 1999


302 • A. Jain et al.

(a) (b)
Figure 28. Texture image segmentation results. (a): Four-class texture mosaic. (b): Four-cluster
solution produced by CLUSTER with pixel coordinates included in the feature set.

[Dubes 1987] is used to select the best nance imaging channels (yielding a five-
clustering. Minimum distance classifi- dimensional feature vector at each
cation is used to label each of the origi- pixel). A number of clusterings were
nal image pixels. This technique was obtained and combined with domain
tested on several texture mosaics in- knowledge (human expertise) to identify
cluding the natural Brodatz textures the different classes. Decision rules for
and synthetic images. Figure 28(a) supervised classification were based on
shows an input texture mosaic consist- these obtained classes. Figure 29(a)
ing of four of the popular Brodatz tex- shows one channel of an input multi-
tures [Brodatz 1966]. Part b shows the spectral image; part b shows the 9-clus-
segmentation produced when the Gabor ter result.
filter features are augmented to contain The k -means algorithm was applied
spatial information (pixel coordinates). to the segmentation of LANDSAT imag-
This Gabor filter based technique has ery in Solberg et al. [1996]. Initial clus-
proven very powerful and has been ex- ter centers were chosen interactively by
tended to the automatic segmentation of a trained operator, and correspond to
text in documents [Jain and Bhatta- land-use classes such as urban areas,
charjee 1992] and segmentation of ob- soil (vegetation-free) areas, forest,
jects in complex backgrounds [Jain et grassland, and water. Figure 30(a)
al. 1997]. shows the input image rendered as
Clustering can be used as a prepro- grayscale; part b shows the result of the
cessing stage to identify pattern classes clustering procedure.
for subsequent supervised classifica-
tion. Taxt and Lundervold [1994] and 6.1.3 Summary. In this section, the
Lundervold et al. [1996] describe a par- application of clustering methodology to
titional clustering algorithm and a man- image segmentation problems has been
ual labeling technique to identify mate- motivated and surveyed. The historical
rial classes (e.g., cerebrospinal fluid, record shows that clustering is a power-
white matter, striated muscle, tumor) in ful tool for obtaining classifications of
registered images of a human head ob- image pixels. Key issues in the design of
tained at five different magnetic reso- any clustering-based segmenter are the

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 303

(a) (b)
Figure 29. Multispectral medical image segmentation. (a): A single channel of the input image. (b):
9-cluster segmentation.

(a) (b)
Figure 30. LANDSAT image segmentation. (a): Original image (ESA/EURIMAGE/Sattelitbild). (b):
Clustered scene.

choice of pixel measurements (features) large numbers of patterns and/or fea-


and dimensionality of the feature vector tures), and the identification of neces-
(i.e., should the feature vector contain sary pre- and post-processing tech-
intensities, pixel positions, model pa- niques (e.g., image smoothing and
rameters, filter outputs?), a measure of minimum distance classification). The
similarity which is appropriate for the use of clustering for segmentation dates
selected features and the application do- back to the 1960s, and new variations
main, the identification of a clustering continue to emerge in the literature.
algorithm, the development of strate- Challenges to the more successful use of
gies for feature and data reduction (to clustering include the high computa-
avoid the “curse of dimensionality” and tional complexity of many clustering al-
the computational burden of classifying gorithms and their incorporation of

ACM Computing Surveys, Vol. 31, No. 3, September 1999


304 • A. Jain et al.

strong assumptions (often multivariate shape index values (which are related to
Gaussian) about the multidimensional surface curvature values) and accumu-
shape of clusters to be obtained. The lating all the object pixels that fall into
ability of new clustering procedures to each bin. By normalizing the spectrum
handle concepts and semantics in classi- with respect to the total object area, the
fication (in addition to numerical mea- scale (size) differences that may exist
surements) will be important for certain between different objects are removed.
applications [Michalski and Stepp 1983; The first moment m 1 is computed as the
Murty and Jain 1995]. # ~h!:
weighted mean of H

6.2 Object and Character Recognition m1 5 O~h!H# ~h!.


h
(1)
6.2.1 Object Recognition. The use of
The other central moments, m p , 2 # p
clustering to group views of 3D objects
for the purposes of object recognition in # 10 are defined as:
range data was described in Dorai and
Jain [1995]. The term view refers to a
range image of an unoccluded object
mp 5 O~h 2 m ! H# ~h!.
h
1
p
(2)

obtained from any arbitrary viewpoint. Then, the feature vector is denoted as
The system under consideration em- R 5 ~ m 1 , m 2 , · · ·, m 10 ! , with the
ployed a viewpoint dependent (or view- range of each of these moments being
centered) approach to the object recog- @ 21,1 # .
nition problem; each object to be
recognized was represented in terms of Let 2 5 $ O 1 , O 2 , · · ·, O n % be a col-
a library of range images of that object. lection of n 3D objects whose views are
There are many possible views of a 3D present in the model database, } D . The
object and one goal of that work was to i th view of the j th object, O ij in the
avoid matching an unknown input view database is represented by ^ L ij , R ij & ,
against each image of each object. A
common theme in the object recognition where L ij is the object label and R ij is the
literature is indexing, wherein the un- feature vector. Given a set of object
known view is used to select a subset of representations 5 i 5 $^ L i1 , R i1 & , · · ·,
views of a subset of the objects in the ^ L im , R im &% that describes m views of the
database for further comparison, and i th object, the goal is to derive a par-
rejects all other views of objects. One of tition of the views, 3 i 5 $ C i1 ,
the approaches to indexing employs the
notion of view classes; a view class is the C 2 , · · ·, C k i % . Each cluster in 3 i con-
i i

set of qualitatively similar views of an tains those views of the i th object that
object. In that work, the view classes have been adjudged similar based on
were identified by clustering; the rest of the dissimilarity between the corre-
this subsection outlines the technique. sponding moment features of the shape
Object views were grouped into spectra of the views. The measure of
classes based on the similarity of shape dissimilarity, between R ij and R ik , is de-
spectral features. Each input image of fined as:
an object viewed in isolation yields a
feature vector which characterizes that
O~ R
10
view. The feature vector contains the $~Rij, Rik! 5 i
jl 2 Rikl!2. (3)
first ten central moments of a normal- l51

ized shape spectral distribution, H # ~h!, 6.2.2 Clustering Views. A database


of an object view. The shape spectrum of containing 3,200 range images of 10 dif-
an object view is obtained from its range ferent sculpted objects with 320 views
data by constructing a histogram of per object is used [Dorai and Jain 1995].

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 305

Figure 31. A subset of views of Cobra chosen from a set of 320 views.

The range images from 320 possible ject is shown in Figure 32. The view
viewpoints (determined by the tessella- grouping hierarchies of the other nine
tion of the view-sphere using the icosa- objects are similar to the dendrogram in
hedron) of the objects were synthesized. Figure 32. This dendrogram is cut at a
Figure 31 shows a subset of the collec- dissimilarity level of 0.1 or less to ob-
tion of views of Cobra used in the exper- tain compact and well-separated clus-
iment. ters. The clusterings obtained in this
The shape spectrum of each view is manner demonstrate that the views of
computed and then its feature vector is each object fall into several distinguish-
determined. The views of each object able clusters. The centroid of each of
are clustered, based on the dissimilarity these clusters was determined by com-
measure $ between their moment vec- puting the mean of the moment vectors
tors using the complete-link hierarchi- of the views falling into the cluster.
cal clustering scheme [Jain and Dubes Dorai and Jain [1995] demonstrated
1988]. The hierarchical grouping ob- that this clustering-based view grouping
tained with 320 views of the Cobra ob- procedure facilitates object matching

ACM Computing Surveys, Vol. 31, No. 3, September 1999


306 • A. Jain et al.

0.25
0.20
0.15
0.10
0.05
0.0

Figure 32. Hierarchical grouping of 320 views of a cobra sculpture.

in terms of classification accuracy and independent system, on the other hand,


the number of matches necessary for must be able to recognize a wide variety
correct classification of test views. Ob- of writing styles in order to satisfy an
ject views are grouped into compact and individual user. As the variability of the
homogeneous view clusters, thus dem- writing styles that must be captured by
onstrating the power of the cluster- a system increases, it becomes more and
based scheme for view organization and more difficult to discriminate between
efficient object matching. different classes due to the amount of
6.2.3 Character Recognition. Clus- overlap in the feature space. One solu-
tering was employed in Connell and tion to this problem is to separate the
Jain [1998] to identify lexemes in hand- data from these disparate writing styles
written text for the purposes of writer- for each class into different subclasses,
independent handwriting recognition. known as lexemes. These lexemes repre-
The success of a handwriting recogni- sent portions of the data which are more
tion system is vitally dependent on its easily separated from the data of classes
acceptance by potential users. Writer- other than that to which the lexeme
dependent systems provide a higher belongs.
level of recognition accuracy than writ- In this system, handwriting is cap-
er-independent systems, but require a tured by digitizing the ~ x, y ! position of
large amount of training data. A writer- the pen and the state of the pen point

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 307

(up or down) at a constant sampling ferent subjects. For example, label Q


rate. Following some resampling, nor- corresponds to books in the area of sci-
malization, and smoothing, each stroke ence, and the subclass QA is assigned to
of the pen is represented as a variable- mathematics. Labels QA76 to QA76.8
length string of points. A metric based are used for classifying books related to
on elastic template matching and dy- computers and other areas of computer
namic programming is defined to allow science.
the distance between two strokes to be There are several problems associated
calculated. with the classification of books using
Using the distances calculated in this the LCC scheme. Some of these are
manner, a proximity matrix is con- listed below:
structed for each class of digits (i.e., 0
through 9). Each matrix measures the (1) When a user is searching for books
intraclass distances for a particular in a library which deal with a topic
digit class. Digits in a particular class of interest to him, the LCC number
are clustered in an attempt to find a alone may not be able to retrieve all
small number of prototypes. Clustering the relevant books. This is because
is done using the CLUSTER program the classification number assigned
described above [Jain and Dubes 1988], to the books or the subject catego-
in which the feature vector for a digit is ries that are typically entered in the
its N proximities to the digits of the database do not contain sufficient
same class. CLUSTER attempts to pro- information regarding all the topics
duce the best clustering for each value covered in a book. To illustrate this
of K over some range, where K is the point, let us consider the book Algo-
number of clusters into which the data rithms for Clustering Data by Jain
is to be partitioned. As expected, the and Dubes [1988]. Its LCC number
mean squared error (MSE) decreases is ‘QA 278.J35’. In this LCC num-
monotonically as a function of K . The ber, QA 278 corresponds to the topic
“optimal” value of K is chosen by identi- ‘cluster analysis’, J corresponds to
fying a “knee” in the plot of MSE vs. K . the first author’s name and 35 is the
When representing a cluster of digits serial number assigned by the Li-
by a single prototype, the best on-line brary of Congress. The subject cate-
recognition results were obtained by us- gories for this book provided by the
ing the digit that is closest to that clus- publisher (which are typically en-
ter’s center. Using this scheme, a cor- tered in a database to facilitate
rect recognition rate of 99.33% was search) are cluster analysis, data
obtained. processing and algorithms. There is
a chapter in this book [Jain and
Dubes 1988] that deals with com-
6.3 Information Retrieval
puter vision, image processing, and
Information retrieval (IR) is concerned image segmentation. So a user look-
with automatic storage and retrieval of ing for literature on computer vision
documents [Rasmussen 1992]. Many and, in particular, image segmenta-
university libraries use IR systems to tion will not be able to access this
provide access to books, journals, and book by searching the database with
other documents. Libraries use the Li- the help of either the LCC number
brary of Congress Classification (LCC) or the subject categories provided in
scheme for efficient storage and re- the database. The LCC number for
trieval of books. The LCC scheme con- computer vision books is TA 1632
sists of classes labeled A to Z [LC Clas- [LC Classification 1990] which is
sification Outline 1990] which are used very different from the number QA
to characterize books belonging to dif- 278.J35 assigned to this book.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


308 • A. Jain et al.

(2) There is an inherent problem in as- sake of brevity in representation, the


signing LCC numbers to books in a fourth-level nodes in the ACM CR clas-
rapidly developing area. For exam- sification tree are labeled using numer-
ple, let us consider the area of neu- als 1 to 9 and characters A to Z. For
ral networks. Initially, category ‘QP’ example, the children nodes of I.5.1
in LCC scheme was used to label (models) are labeled I.5.1.1 to I.5.1.6.
books and conference proceedings in Here, I.5.1.1 corresponds to the node
this area. For example, Proceedings labeled deterministic, and I.5.1.6 stands
of the International Joint Conference for the node labeled structural. In a
on Neural Networks [IJCNN’91] was similar fashion, all the fourth-level
assigned the number ‘QP 363.3’. But nodes in the tree can be labeled as nec-
most of the recent books on neural essary. From now on, the dots in be-
networks are given a number using tween successive symbols will be omit-
the category label ‘QA’; Proceedings ted to simplify the representation. For
of the IJCNN’92 [IJCNN’92] is as- example, I.5.1.1 will be denoted as I511.
signed the number ‘QA 76.87’. Mul- We illustrate this process of represen-
tiple labels for books dealing with tation with the help of the book by Jain
the same topic will force them to be and Dubes [1988]. There are five chap-
placed on different stacks in a li- ters in this book. For simplicity of pro-
brary. Hence, there is a need to up- cessing, we consider only the informa-
date the classification labels from tion in the chapter contents. There is a
time to time in an emerging disci- single entry in the table of contents for
pline. chapter 1, ‘Introduction,’ and so we do
not extract any keywords from this.
(3) Assigning a number to a new book is Chapter 2, labeled ‘Data Representa-
a difficult problem. A book may deal tion,’ has section titles that correspond
with topics corresponding to two or to the labels of the nodes in the ACM
more LCC numbers, and therefore, CR classification tree [ACM CR Classifi-
assigning a unique number to such cations 1994] which are given below:
a book is difficult.
(1a) I522 (feature evaluation and selec-
Murty and Jain [1995] describe a tion),
knowledge-based clustering scheme to
(2b) I532 (similarity measures), and
group representations of books, which
are obtained using the ACM CR (Associ- (3c) I515 (statistical).
ation for Computing Machinery Com- Based on the above analysis, Chapter 2 of
puting Reviews) classification tree Jain and Dubes [1988] can be character-
[ACM CR Classifications 1994]. This ized by the weighted disjunction
tree is used by the authors contributing ((I522 ∨ I532 ∨ I515)(1,4)). The weights
to various ACM publications to provide (1,4) denote that it is one of the four chap-
keywords in the form of ACM CR cate- ters which plays a role in the representa-
gory labels. This tree consists of 11 tion of the book. Based on the table of
nodes at the first level. These nodes are contents, we can use one or more of the
labeled A to K. Each node in this tree strings I522, I532, and I515 to represent
has a label that is a string of one or Chapter 2. In a similar manner, we can
more symbols. These symbols are alpha- represent other chapters in this book as
numeric characters. For example, I515 weighted disjunctions based on the table of
is the label of a fourth-level node in the contents and the ACM CR classification
tree. tree. The representation of the entire book,
the conjunction of all these chapter repre-
6.3.1 Pattern Representation. Each
book is represented as a generalized list sentations, is given by ~~~ I522 ∨ I532 ∨
[Sangal 1991] of these strings using the I515 !~ 1,4 ! ∧ ~~ I515 ∨ I531 !~ 2,4 !! ∧
ACM CR classification tree. For the ~~ I541 ∨ I46 ∨ I434 !~ 1,4 !!! .

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 309

Currently, these representations are CR classification tree is captured by the


generated manually by scanning the ta- representation in the form of strings.
ble of contents of books in computer For example, node labeled pattern rec-
science area as ACM CR classification ognition is represented by the string I5,
tree provides knowledge of computer whereas the string I53 corresponds to
science books only. The details of the the node labeled clustering. The similar-
collection of books used in this study are ity between these two nodes (I5 and I53)
available in Murty and Jain [1995]. is 1.0. A symmetric measure of similar-
ity [Murty and Jain 1995] is used to
6.3.2 Similarity Measure. The simi- construct a similarity matrix of size 100
larity between two books is based on the x 100 corresponding to 100 books used
similarity between the corresponding in experiments.
strings. Two of the well-known distance 6.3.3 An Algorithm for Clustering
functions between a pair of strings are Books. The clustering problem can be
[Baeza-Yates 1992] the Hamming dis-
tance and the edit distance. Neither of stated as follows. Given a collection @
these two distance functions can be of books, we need to obtain a set # of
meaningfully used in this application. clusters. A proximity dendrogram [Jain
The following example illustrates the and Dubes 1988], using the complete-
point. Consider three strings I242, I233, link agglomerative clustering algorithm
and H242. These strings are labels for the collection of 100 books is shown
(predicate logic for knowledge represen- in Figure 33. Seven clusters are ob-
tation, logic programming, and distrib- tained by choosing a threshold (t ) value
uted database systems) of three fourth- of 0.12. It is well known that different
level nodes in the ACM CR values for t might give different cluster-
classification tree. Nodes I242 and I233 ings. This threshold value is chosen be-
are the grandchildren of the node la- cause the “gap” in the dendrogram be-
beled I2 (artificial intelligence) and tween the levels at which six and seven
H242 is a grandchild of the node labeled clusters are formed is the largest. An
H2 (database management). So, the dis- examination of the subject areas of the
tance between I242 and I233 should be books [Murty and Jain 1995] in these
smaller than that between I242 and clusters revealed that the clusters ob-
H242. However, Hamming distance and tained are indeed meaningful. Each of
edit distance [Baeza-Yates 1992] both these clusters are represented using a
have a value 2 between I242 and I233 list of string s and frequency s f pairs,
and a value of 1 between I242 and
H242. This limitation motivates the def- where s f is the number of books in the
inition of a new similarity measure that cluster in which s is present. For exam-
correctly captures the similarity be- ple, cluster c1 contains 43 books belong-
tween the above strings. The similarity ing to pattern recognition, neural net-
between two strings is defined as the works, artificial intelligence, and
ratio of the length of the largest com- computer vision; a part of its represen-
mon prefix [Murty and Jain 1995] be- tation 5 ~ C 1 ! is given below.
tween the two strings to the length of
the first string. For example, the simi- 5~C1! 5 ~~B718,1!, ~C12,1!, ~D0,2!,
larity between strings I522 and I51 is
0.5. The proposed similarity measure is ~D311,1!, ~D312,2!, ~D321,1!,
not symmetric because the similarity ~D322,1!, ~D329,1!, . . . ~I46,3!,
between I51 and I522 is 0.67. The mini-
mum and maximum values of this simi- ~I461,2!, ~I462,1!, ~I463, 3!,
larity measure are 0.0 and 1.0, respec- . . . ~J26,1!, ~J6,1!,
tively. The knowledge of the
relationship between nodes in the ACM ~J61,7!, ~J71,1!)

ACM Computing Surveys, Vol. 31, No. 3, September 1999


310 • A. Jain et al.

These clusters of books and the corre- coaches detecting trends and patterns of
sponding cluster descriptions can be play for individual players and teams,
used as follows: If a user is searching and categorizing patterns of children in
for books, say, on image segmentation the foster care system [Hedberg 1996].
(I46), then we select cluster C 1 because Several journals have had recent special
its representation alone contains the issues on data mining [Cohen 1996,
string I46. Books B 2 (Neurocomputing) Cross 1996, Wah 1996].
and B 18 (Sensory Neural Networks: Lat- 6.4.1 Data Mining Approaches.
eral Inhibition) are both members of clus- Data mining, like clustering, is an ex-
ter C 1 even though their LCC numbers ploratory activity, so clustering methods
are quite different (B 2 is QA76.5.H4442, are well suited for data mining. Cluster-
B 18 is QP363.3.N33 ). ing is often an important initial step of
Four additional books labeled B 101 , several in the data mining process
[Fayyad 1996]. Some of the data mining
B 102 , B 103 , and B 104 have been used to approaches which use clustering are da-
study the problem of assigning classifi- tabase segmentation, predictive model-
cation numbers to new books. The LCC ing, and visualization of large data-
numbers of these books are: (B 101 ) bases.
Q335.T39, (B 102 ) QA76.73.P356C57, Segmentation. Clustering methods
(B 103 ) QA76.5.B76C.2, and (B 104 ) are used in data mining to segment
QA76.9D5W44. These books are as- databases into homogeneous groups.
signed to clusters based on nearest This can serve purposes of data com-
neighbor classification. The nearest pression (working with the clusters
rather than individual items), or to
neighbor of B 101 , a book on artificial identify characteristics of subpopula-
intelligence, is B 23 and so B 101 is as- tions which can be targeted for specific
signed to cluster C 1 . It is observed that purposes (e.g., marketing aimed at se-
the assignment of these four books to nior citizens).
the respective clusters is meaningful, A continuous k-means clustering algo-
demonstrating that knowledge-based rithm [Faber 1994] has been used to
clustering is useful in solving problems cluster pixels in Landsat images [Faber
associated with document retrieval. et al. 1994]. Each pixel originally has 7
values from different satellite bands,
6.4 Data Mining including infra-red. These 7 values are
difficult for humans to assimilate and
In recent years we have seen ever in- analyze without assistance. Pixels with
creasing volumes of collected data of all the 7 feature values are clustered into
sorts. With so much data available, it is 256 groups, then each pixel is assigned
necessary to develop algorithms which the value of the cluster centroid. The
can extract meaningful information image can then be displayed with the
from the vast stores. Searching for use- spatial information intact. Human view-
ful nuggets of information among huge ers can look at a single picture and
amounts of data has become known as identify a region of interest (e.g., high-
the field of data mining. way or forest) and label it as a concept.
Data mining can be applied to rela- The system then identifies other pixels
tional, transaction, and spatial data- in the same cluster as an instance of
bases, as well as large stores of unstruc- that concept.
tured data such as the World Wide Web. Predictive Modeling. Statistical meth-
There are many data mining systems in ods of data analysis usually involve hy-
use today, and applications include the pothesis testing of a model the analyst
U.S. Treasury detecting money launder- already has in mind. Data mining can
ing, National Basketball Association aid the user in discovering potential

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 311

1.0
0.8
0.6

94

89

56
75
0.4

3
50

12
16
19
31
81

64
74
0.2

58
61
21
55

51

79
80
48
85
25

20
53

77
63

33
18

11
78
91
95

59
60
30
96
92
100

29

49
52
82
83
6
7

57
88
0.0

13
14
17
90

66
73

84
2
10

22
24

28

39
40
1

62
65
36

86
87
93
98

41
8
26

32
97
99

15
72

43
47
37
38
34
35

54
44
45
71
23
27

67
68
69
70
76

4
5

42
46

Figure 33. A dendrogram corresponding to 100 books.

hypotheses prior to using statistical which will distinguish those subscribers


tools. Predictive modeling uses cluster- that will renew their subscriptions from
ing to group items, then infers rules to those that will not [Simoudis 1996].
characterize the groups and suggest Visualization. Clusters in large data-
models. For example, magazine sub- bases can be used for visualization, in
scribers can be clustered based on a order to aid human analysts in identify-
number of factors (age, sex, income, ing groups and subgroups that have
etc.), then the resulting groups charac- similar characteristics. WinViz [Lee and
terized in an attempt to find a model Ong 1996] is a data mining visualization

ACM Computing Surveys, Vol. 31, No. 3, September 1999


312 • A. Jain et al.

Figure 34. The seven smallest clusters found in the document set. These are stemmed words.

tool in which derived clusters can be ment categorization based on words as


exported as new attributes which can features.
then be characterized by the system. Rather than grouping documents in a
For example, breakfast cereals are clus- word feature space, Wulfekuhler and
tered according to calories, protein, fat, Punch [1997] cluster the words from a
sodium, fiber, carbohydrate, sugar, po- small collection of World Wide Web doc-
tassium, and vitamin content per serv- uments in the document space. The
ing. Upon seeing the resulting clusters, sample data set consisted of 85 docu-
the user can export the clusters to Win- ments from the manufacturing domain
Viz as attributes. The system shows in 4 different user-defined categories
that one of the clusters is characterized (labor, legal, government, and design).
by high potassium content, and the hu- These 85 documents contained 5190 dis-
man analyst recognizes the individuals tinct word stems after common words
in the cluster as belonging to the “bran” (the, and, of) were removed. Since the
cereal family, leading to a generaliza- words are certainly not uncorrelated,
tion that “bran cereals are high in po- they should fall into clusters where
tassium.” words used in a consistent way across
the document set have similar values of
6.4.2 Mining Large Unstructured Da-
frequency in each document.
tabases. Data mining has often been
performed on transaction and relational K -means clustering was used to group
databases which have well-defined the 5190 words into 10 groups. One
fields which can be used as features, but surprising result was that an average of
there has been recent research on large 92% of the words fell into a single clus-
unstructured databases such as the ter, which could then be discarded for
World Wide Web [Etzioni 1996]. data mining purposes. The smallest
Examples of recent attempts to clas- clusters contained terms which to a hu-
sify Web documents using words or man seem semantically related. The 7
functions of words as features include smallest clusters from a typical run are
Maarek and Shaul [1996] and Chekuri shown in Figure 34.
et al. [1999]. However, relatively small Terms which are used in ordinary
sets of labeled training samples and contexts, or unique terms which do not
very large dimensionality limit the ulti- occur often across the training docu-
mate success of automatic Web docu- ment set will tend to cluster into the

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 313

large 4000 member group. This takes vent the hydrocarbon from leaking
care of spelling errors, proper names away. A large volume of porous sedi-
which are infrequent, and terms which ments is crucial to finding good recover-
are used in the same manner through- able reserves, therefore developing reli-
out the entire document set. Terms used able and accurate methods for
in specific contexts (such as file in the estimation of sediment porosities from
context of filing a patent, rather than a the collected data is key to estimating
computer file) will appear in the docu- hydrocarbon potential.
ments consistently with other terms ap- The general rule of thumb experts use
propriate to that context (patent, invent) for porosity computation is that it is a
and thus will tend to cluster together. quasiexponential function of depth:
Among the groups of words, unique con-
texts stand out from the crowd. Porosity 5 K z e2F~x1, x2, ..., xm!zDepth. (4)
After discarding the largest cluster,
the smaller set of features can be used A number of factors such as rock types,
to construct queries for seeking out structure, and cementation as parame-
other relevant documents on the Web ters of function F confound this rela-
using standard Web searching tools tionship. This necessitates the defini-
(e.g., Lycos, Alta Vista, Open Text). tion of proper contexts, in which to
Searching the Web with terms taken attempt discovery of porosity formulas.
from the word clusters allows discovery Geological contexts are expressed in
of finer grained topics (e.g., family med- terms of geological phenomena, such as
ical leave) within the broadly defined geometry, lithology, compaction, and
categories (e.g., labor). subsidence, associated with a region. It
is well known that geological context
6.4.3 Data Mining in Geological Da- changes from basin to basin (different
tabases. Database mining is a critical geographical areas in the world) and
resource in oil exploration and produc- also from region to region within a ba-
tion. It is common knowledge in the oil sin [Allen and Allen 1990; Biswas 1995].
industry that the typical cost of drilling Furthermore, the underlying features of
a new offshore well is in the range of contexts may vary greatly. Simple
$30-40 million, but the chance of that model matching techniques, which work
site being an economic success is 1 in in engineering domains where behavior
10. More informed and systematic drill- is constrained by man-made systems
ing decisions can significantly reduce and well-established laws of physics,
overall production costs. may not apply in the hydrocarbon explo-
Advances in drilling technology and ration domain. To address this, data
data collection methods have led to oil clustering was used to identify the rele-
companies and their ancillaries collect- vant contexts, and then equation discov-
ing large amounts of geophysical/geolog- ery was carried out within each context.
ical data from production wells and ex-
ploration sites, and then organizing The goal was to derive the subset x 1 ,
them into large databases. Data mining x 2 , ..., x m from a larger set of geological
techniques has recently been used to features, and the functional relation-
derive precise analytic relations be- ship F that best defined the porosity
tween observed phenomena and param- function in a region.
eters. These relations can then be used The overall methodology illustrated
to quantify oil and gas reserves. in Figure 35, consists of two primary
In qualitative terms, good recoverable steps: (i) Context definition using unsu-
reserves have high hydrocarbon satura- pervised clustering techniques, and (ii)
tion that are trapped by highly porous Equation discovery by regression analy-
sediments (reservoir porosity) and sur- sis [Li and Biswas 1995]. Real explora-
rounded by hard bulk rocks that pre- tion data collected from a region in the

ACM Computing Surveys, Vol. 31, No. 3, September 1999


314 • A. Jain et al.

Figure 35. Description of the knowledge-based scientific discovery process.

Alaska basin was analyzed using the sample measurements collected from
methodology developed. The data ob- wells is the Alaskan Basin. The
jects (patterns) are described in terms of k -means clustered this data set into
37 geological features, such as porosity, seven groups. As an illustration, we se-
permeability, grain size, density, and lected a set of 138 objects representing a
sorting, amount of different mineral context for further analysis. The fea-
fragments (e.g., quartz, chert, feldspar) tures that best defined this cluster were
present, nature of the rock fragments, selected, and experts surmised that the
pore characteristics, and cementation. context represented a low porosity re-
All these feature values are numeric gion, which was modeled using the re-
measurements made on samples ob- gression procedure.
tained from well-logs during exploratory
drilling processes. 7. SUMMARY
The k -means clustering algorithm
was used to identify a set of homoge- There are several applications where
neous primitive geological structures decision making and exploratory pat-
~ g 1 , g 2 , ..., g m ! . These primitives were tern analysis have to be performed on
then mapped onto the unit code versus large data sets. For example, in docu-
stratigraphic unit map. Figure 36 de- ment retrieval, a set of relevant docu-
picts a partial mapping for a set of wells ments has to be found among several
and four primitive structures. The next millions of documents of dimensionality
step in the discovery process identified of more than 1000. It is possible to
sections of wells regions that were made handle these problems if some useful
up of the same sequence of geological abstraction of the data is obtained and
primitives. Every sequence defined a is used in decision making, rather than
directly using the entire data set. By
context C i . From the partial mapping of data abstraction, we mean a simple and
Figure 36, the context C 1 5 g 2 + g 1 + compact representation of the data.
g 2 + g 3 was identified in two well re- This simplicity helps the machine in
gions (the 300 and 600 series). After the efficient processing or a human in com-
contexts were defined, data points be- prehending the structure in data easily.
longing to each context were grouped Clustering algorithms are ideally suited
together for equation derivation. The for achieving data abstraction.
derivation procedure employed multiple In this paper, we have examined var-
regression analysis [Sen and Srivastava ious steps in clustering: (1) pattern rep-
1990]. resentation, (2) similarity computation,
This method was applied to a data set (3) grouping process, and (4) cluster rep-
of about 2600 objects corresponding to resentation. Also, we have discussed

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 315

Area Code
100 200 300 400 500 600 700

2000

3000

3100

3110
Stratigraphic Unit

3120

3130

3140

3150

3160

3170

3180

3190

Figure 36. Area code versus stratigraphic unit map for part of the studied region.

statistical, fuzzy, neural, evolutionary, tations are available as input to the


and knowledge-based approaches to clustering algorithm. In small size data
clustering. We have described four ap- sets, pattern representations can be ob-
plications of clustering: (1) image seg- tained based on previous experience of
mentation, (2) object recognition, (3) the user with the problem. However, in
document retrieval, and (4) data min- the case of large data sets, it is difficult
ing. for the user to keep track of the impor-
Clustering is a process of grouping tance of each feature in clustering. A
data items based on a measure of simi- solution is to make as many measure-
larity. Clustering is a subjective pro- ments on the patterns as possible and
cess; the same set of data items often use them in pattern representation. But
needs to be partitioned differently for it is not possible to use a large collection
different applications. This subjectivity of measurements directly in clustering
makes the process of clustering difficult. because of computational costs. So sev-
This is because a single algorithm or eral feature extraction/selection ap-
approach is not adequate to solve every proaches have been designed to obtain
clustering problem. A possible solution linear or nonlinear combinations of
lies in reflecting this subjectivity in the these measurements which can be used
form of knowledge. This knowledge is to represent patterns. Most of the
used either implicitly or explicitly in schemes proposed for feature extrac-
one or more phases of clustering. tion/selection are typically iterative in
Knowledge-based clustering algorithms nature and cannot be used on large data
use domain knowledge explicitly. sets due to prohibitive computational
The most challenging step in cluster- costs.
ing is feature extraction or pattern rep- The second step in clustering is simi-
resentation. Pattern recognition re- larity computation. A variety of
searchers conveniently avoid this step schemes have been used to compute
by assuming that the pattern represen- similarity between two patterns. They

ACM Computing Surveys, Vol. 31, No. 3, September 1999


316 • A. Jain et al.

use knowledge either implicitly or ex- handle mixed data types. However, a
plicitly. Most of the knowledge-based major problem with fuzzy clustering is
clustering algorithms use explicit that it is difficult to obtain the member-
knowledge in similarity computation. ship values. A general approach may
However, if patterns are not repre- not work because of the subjective na-
sented using proper features, then it is ture of clustering. It is required to rep-
not possible to get a meaningful parti- resent clusters obtained in a suitable
tion irrespective of the quality and form to help the decision maker. Knowl-
quantity of knowledge used in similar- edge-based clustering schemes generate
ity computation. There is no universally intuitively appealing descriptions of
acceptable scheme for computing simi- clusters. They can be used even when
larity between patterns represented us- the patterns are represented using a
ing a mixture of both qualitative and combination of qualitative and quanti-
quantitative features. Dissimilarity be- tative features, provided that knowl-
tween a pair of patterns is represented edge linking a concept and the mixed
using a distance measure that may or features are available. However, imple-
may not be a metric. mentations of the conceptual clustering
The next step in clustering is the schemes are computationally expensive
grouping step. There are broadly two and are not suitable for grouping large
grouping schemes: hierarchical and par- data sets.
titional schemes. The hierarchical
schemes are more versatile, and the The k -means algorithm and its neural
partitional schemes are less expensive. implementation, the Kohonen net, are
The partitional algorithms aim at maxi- most successfully used on large data
mizing the squared error criterion func- sets. This is because k -means algorithm
tion. Motivated by the failure of the is simple to implement and computa-
squared error partitional clustering al- tionally attractive because of its linear
gorithms in finding the optimal solution time complexity. However, it is not fea-
to this problem, a large collection of sible to use even this linear time algo-
approaches have been proposed and rithm on large data sets. Incremental
used to obtain the global optimal solu- algorithms like leader and its neural
tion to this problem. However, these implementation, the ART network, can
schemes are computationally prohibi- be used to cluster large data sets. But
tive on large data sets. ANN-based clus- they tend to be order-dependent. Divide
tering schemes are neural implementa- and conquer is a heuristic that has been
tions of the clustering algorithms, and rightly exploited by computer algorithm
they share the undesired properties of designers to reduce computational costs.
these algorithms. However, ANNs have However, it should be judiciously used
the capability to automatically normal- in clustering to achieve meaningful re-
ize the data and extract features. An sults.
important observation is that even if a In summary, clustering is an interest-
scheme can find the optimal solution to ing, useful, and challenging problem. It
the squared error partitioning problem, has great potential in applications like
it may still fall short of the require- object recognition, image segmentation,
ments because of the possible non-iso- and information filtering and retrieval.
tropic nature of the clusters. However, it is possible to exploit this
In some applications, for example in potential only after making several de-
document retrieval, it may be useful to sign choices carefully.
have a clustering that is not a partition.
This means clusters are overlapping. ACKNOWLEDGMENTS
Fuzzy clustering and functional cluster-
ing are ideally suited for this purpose. The authors wish to acknowledge the
Also, fuzzy clustering algorithms can generosity of several colleagues who

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 317

read manuscript drafts, made sugges- glomerative clustering of uniform neigh-


tions, and provided summaries of bourhoods. Pattern Recogn. 21, 3 (1988),
261–268.
emerging application areas which we
ANDERBERG, M. R. 1973. Cluster Analysis for
have incorporated into this paper. Gau- Applications. Academic Press, Inc., New
tam Biswas and Cen Li of Vanderbilt York, NY.
University provided the material on AUGUSTSON, J. G. AND MINKER, J. 1970. An
knowledge discovery in geological data- analysis of some graph theoretical clustering
bases. Ana Fred of Instituto Superior techniques. J. ACM 17, 4 (Oct. 1970), 571–
588.
Técnico in Lisbon, Portugal provided BABU, G. P. AND MURTY, M. N. 1993. A near-
material on cluster analysis in the syn- optimal initial seed value selection in
tactic domain. William Punch and Mari- K-means algorithm using a genetic algorithm.
lyn Wulfekuhler of Michigan State Uni- Pattern Recogn. Lett. 14, 10 (Oct. 1993), 763–
versity provided material on the 769.
application of cluster analysis to data BABU, G. P. AND MURTY, M. N. 1994. Clustering
with evolution strategies. Pattern Recogn.
mining problems. Scott Connell of Mich- 27, 321–329.
igan State provided material describing BABU, G. P., MURTY, M. N., AND KEERTHI, S.
his work on character recognition. Chi- S. 2000. Stochastic connectionist approach
tra Dorai of IBM T.J. Watson Research for pattern clustering (To appear). IEEE
Center provided material on the use of Trans. Syst. Man Cybern..
clustering in 3D object recognition. BACKER, F. B. AND HUBERT, L. J. 1976. A graph-
theoretic approach to goodness-of-fit in com-
Jianchang Mao of IBM Almaden Re- plete-link hierarchical clustering. J. Am.
search Center, Peter Bajcsy of the Uni- Stat. Assoc. 71, 870 – 878.
versity of Illinois, and Zoran Obradović BACKER, E. 1995. Computer-Assisted Reasoning
of Washington State University also in Cluster Analysis. Prentice Hall Interna-
provided many helpful comments. Mario tional (UK) Ltd., Hertfordshire, UK.
de Figueirido performed a meticulous BAEZA-YATES, R. A. 1992. Introduction to data
structures and algorithms related to informa-
reading of the manuscript and provided tion retrieval. In Information Retrieval:
many helpful suggestions. Data Structures and Algorithms, W. B.
This work was supported by the Na- Frakes and R. Baeza-Yates, Eds. Prentice-
tional Science Foundation under grant Hall, Inc., Upper Saddle River, NJ, 13–27.
INT-9321584. BAJCSY, P. 1997. Hierarchical segmentation
and clustering using similarity analy-
sis. Ph.D. Dissertation. Department of
REFERENCES Computer Science, University of Illinois at
Urbana-Champaign, Urbana, IL.
AARTS, E. AND KORST, J. 1989. Simulated An- BALL, G. H. AND HALL, D. J. 1965. ISODATA, a
nealing and Boltzmann Machines: A Stochas- novel method of data analysis and classifica-
tic Approach to Combinatorial Optimization tion. Tech. Rep.. Stanford University,
and Neural Computing. Wiley-Interscience Stanford, CA.
series in discrete mathematics and optimiza- BENTLEY, J. L. AND FRIEDMAN, J. H. 1978. Fast
tion. John Wiley and Sons, Inc., New York, algorithms for constructing minimal spanning
NY. trees in coordinate spaces. IEEE Trans.
ACM, 1994. ACM CR Classifications. ACM Comput. C-27, 6 (June), 97–105.
Computing Surveys 35, 5–16. BEZDEK, J. C. 1981. Pattern Recognition With
AL-SULTAN, K. S. 1995. A tabu search approach
Fuzzy Objective Function Algorithms. Ple-
to clustering problems. Pattern Recogn. 28,
num Press, New York, NY.
1443–1451.
AL-SULTAN, K. S. AND KHAN, M. M. 1996. BHUYAN, J. N., RAGHAVAN, V. V., AND VENKATESH,
Computational experience on four algorithms K. E. 1991. Genetic algorithm for cluster-
for the hard clustering problem. Pattern ing with an ordered representation. In Pro-
Recogn. Lett. 17, 3, 295–308. ceedings of the Fourth International Confer-
ALLEN, P. A. AND ALLEN, J. R. 1990. Basin ence on Genetic Algorithms, 408 – 415.
Analysis: Principles and Applica- BISWAS, G., WEINBERG, J., AND LI, C. 1995. A
tions. Blackwell Scientific Publications, Inc., Conceptual Clustering Method for Knowledge
Cambridge, MA. Discovery in Databases. Editions Technip.
ALTA VISTA, 1999. http://altavista.digital.com. BRAILOVSKY, V. L. 1991. A probabilistic ap-
AMADASUN, M. AND KING, R. A. 1988. Low-level proach to clustering. Pattern Recogn. Lett.
segmentation of multispectral images via ag- 12, 4 (Apr. 1991), 193–198.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


318 • A. Jain et al.

BRODATZ, P. 1966. Textures: A Photographic Al- DIDAY, E. 1973. The dynamic cluster method in
bum for Artists and Designers. Dover Publi- non-hierarchical clustering. J. Comput. Inf.
cations, Inc., Mineola, NY. Sci. 2, 61– 88.
CAN, F. 1993. Incremental clustering for dy- DIDAY, E. AND SIMON, J. C. 1976. Clustering
namic information processing. ACM Trans. analysis. In Digital Pattern Recognition, K.
Inf. Syst. 11, 2 (Apr. 1993), 143–164. S. Fu, Ed. Springer-Verlag, Secaucus, NJ,
CARPENTER, G. AND GROSSBERG, S. 1990. ART3: 47–94.
Hierarchical search using chemical transmit- DIDAY, E. 1988. The symbolic approach in clus-
ters in self-organizing pattern recognition ar- tering. In Classification and Related Meth-
chitectures. Neural Networks 3, 129 –152. ods, H. H. Bock, Ed. North-Holland Pub-
CHEKURI, C., GOLDWASSER, M. H., RAGHAVAN, P., lishing Co., Amsterdam, The Netherlands.
AND UPFAL, E. 1997. Web search using au- DORAI, C. AND JAIN, A. K. 1995. Shape spectra
tomatic classification. In Proceedings of the based view grouping for free-form objects. In
Sixth International Conference on the World Proceedings of the International Conference on
Wide Web (Santa Clara, CA, Apr.), http:// Image Processing (ICIP-95), 240 –243.
theory.stanford.edu/people/wass/publications/ DUBES, R. C. AND JAIN, A. K. 1976. Clustering
Web Search/Web Search.html. techniques: The user’s dilemma. Pattern
CHENG, C. H. 1995. A branch-and-bound clus- Recogn. 8, 247–260.
tering algorithm. IEEE Trans. Syst. Man DUBES, R. C. AND JAIN, A. K. 1980. Clustering
Cybern. 25, 895– 898. methodology in exploratory data analysis. In
CHENG, Y. AND FU, K. S. 1985. Conceptual clus- Advances in Computers, M. C. Yovits,, Ed.
tering in knowledge organization. IEEE Academic Press, Inc., New York, NY, 113–
Trans. Pattern Anal. Mach. Intell. 7, 592–598. 125.
CHENG, Y. 1995. Mean shift, mode seeking, and DUBES, R. C. 1987. How many clusters are
clustering. IEEE Trans. Pattern Anal. Mach. best?—an experiment. Pattern Recogn. 20, 6
Intell. 17, 7 (July), 790 –799. (Nov. 1, 1987), 645– 663.
CHIEN, Y. T. 1978. Interactive Pattern Recogni- DUBES, R. C. 1993. Cluster analysis and related
tion. Marcel Dekker, Inc., New York, NY. issues. In Handbook of Pattern Recognition
& Computer Vision, C. H. Chen, L. F. Pau,
CHOUDHURY, S. AND MURTY, M. N. 1990. A divi-
and P. S. P. Wang, Eds. World Scientific
sive scheme for constructing minimal span-
Publishing Co., Inc., River Edge, NJ, 3–32.
ning trees in coordinate space. Pattern
DUBUISSON, M. P. AND JAIN, A. K. 1994. A mod-
Recogn. Lett. 11, 6 (Jun. 1990), 385–389.
ified Hausdorff distance for object matchin-
1996. Special issue on data mining. Commun.
g. In Proceedings of the International Con-
ACM 39, 11.
ference on Pattern Recognition (ICPR
COLEMAN, G. B. AND ANDREWS, H.
’94), 566 –568.
C. 1979. Image segmentation by cluster- DUDA, R. O. AND HART, P. E. 1973. Pattern
ing. Proc. IEEE 67, 5, 773–785. Classification and Scene Analysis. John
CONNELL, S. AND JAIN, A. K. 1998. Learning Wiley and Sons, Inc., New York, NY.
prototypes for on-line handwritten digits. In DUNN, S., JANOS, L., AND ROSENFELD, A. 1983.
Proceedings of the 14th International Confer- Bimean clustering. Pattern Recogn. Lett. 1,
ence on Pattern Recognition (Brisbane, Aus- 169 –173.
tralia, Aug.), 182–184. DURAN, B. S. AND ODELL, P. L. 1974. Cluster
CROSS, S. E., Ed. 1996. Special issue on data Analysis: A Survey. Springer-Verlag, New
mining. IEEE Expert 11, 5 (Oct.). York, NY.
DALE, M. B. 1985. On the comparison of con- EDDY, W. F., MOCKUS, A., AND OUE, S. 1996.
ceptual clustering and numerical taxono- Approximate single linkage cluster analysis of
my. IEEE Trans. Pattern Anal. Mach. Intell. large data sets in high-dimensional spaces.
7, 241–244. Comput. Stat. Data Anal. 23, 1, 29 – 43.
DAVE, R. N. 1992. Generalized fuzzy C-shells ETZIONI, O. 1996. The World-Wide Web: quag-
clustering and detection of circular and ellip- mire or gold mine? Commun. ACM 39, 11,
tic boundaries. Pattern Recogn. 25, 713–722. 65– 68.
DAVIS, T., Ed. 1991. The Handbook of Genetic EVERITT, B. S. 1993. Cluster Analysis. Edward
Algorithms. Van Nostrand Reinhold Co., Arnold, Ltd., London, UK.
New York, NY. FABER, V. 1994. Clustering and the continuous
DAY, W. H. E. 1992. Complexity theory: An in- k-means algorithm. Los Alamos Science 22,
troduction for practitioners of classifica- 138 –144.
tion. In Clustering and Classification, P. FABER, V., HOCHBERG, J. C., KELLY, P. M., THOMAS,
Arabie and L. Hubert, Eds. World Scientific T. R., AND WHITE, J. M. 1994. Concept ex-
Publishing Co., Inc., River Edge, NJ. traction: A data-mining technique. Los
DEMPSTER, A. P., LAIRD, N. M., AND RUBIN, D. Alamos Science 22, 122–149.
B. 1977. Maximum likelihood from incom- FAYYAD, U. M. 1996. Data mining and knowl-
plete data via the EM algorithm. J. Royal edge discovery: Making sense out of data.
Stat. Soc. B. 39, 1, 1–38. IEEE Expert 11, 5 (Oct.), 20 –25.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 319

FISHER, D. AND LANGLEY, P. 1986. Conceptual GORDON, A. D. AND HENDERSON, J. T. 1977.


clustering and its relation to numerical tax- Algorithm for Euclidean sum of squares.
onomy. In Artificial Intelligence and Statis- Biometrics 33, 355–362.
tics, A W. Gale, Ed. Addison-Wesley Long- GOTLIEB, G. C. AND KUMAR, S. 1968. Semantic
man Publ. Co., Inc., Reading, MA, 77–116. clustering of index terms. J. ACM 15, 493–
FISHER, D. 1987. Knowledge acquisition via in- 513.
cremental conceptual clustering. Mach. GOWDA, K. C. 1984. A feature reduction and
Learn. 2, 139 –172. unsupervised classification algorithm for mul-
FISHER, D., XU, L., CARNES, R., RICH, Y., FENVES, S. tispectral data. Pattern Recogn. 17, 6, 667–
J., CHEN, J., SHIAVI, R., BISWAS, G., AND WEIN- 676.
BERG, J. 1993. Applying AI clustering to GOWDA, K. C. AND KRISHNA, G. 1977.
engineering tasks. IEEE Expert 8, 51– 60. Agglomerative clustering using the concept of
FISHER, L. AND VAN NESS, J. W. 1971. mutual nearest neighborhood. Pattern
Admissible clustering procedures. Biometrika Recogn. 10, 105–112.
58, 91–104. GOWDA, K. C. AND DIDAY, E. 1992. Symbolic
FLYNN, P. J. AND JAIN, A. K. 1991. BONSAI: 3D clustering using a new dissimilarity mea-
object recognition using constrained search. sure. IEEE Trans. Syst. Man Cybern. 22,
IEEE Trans. Pattern Anal. Mach. Intell. 13, 368 –378.
10 (Oct. 1991), 1066 –1075. GOWER, J. C. AND ROSS, G. J. S. 1969. Minimum
FOGEL, D. B. AND SIMPSON, P. K. 1993. Evolving spanning rees and single-linkage cluster
fuzzy clusters. In Proceedings of the Interna- analysis. Appl. Stat. 18, 54 – 64.
tional Conference on Neural Networks (San GREFENSTETTE, J 1986. Optimization of control
Francisco, CA), 1829 –1834. parameters for genetic algorithms. IEEE
FOGEL, D. B. AND FOGEL, L. J., Eds. 1994. Spe- Trans. Syst. Man Cybern. SMC-16, 1 (Jan./
cial issue on evolutionary computation. Feb. 1986), 122–128.
IEEE Trans. Neural Netw. (Jan.). HARALICK, R. M. AND KELLY, G. L. 1969.
FOGEL, L. J., OWENS, A. J., AND WALSH, M. Pattern recognition with measurement space
and spatial clustering for multiple images.
J. 1965. Artificial Intelligence Through
Proc. IEEE 57, 4, 654 – 665.
Simulated Evolution. John Wiley and Sons,
HARTIGAN, J. A. 1975. Clustering Algorithms.
Inc., New York, NY.
John Wiley and Sons, Inc., New York, NY.
FRAKES, W. B. AND BAEZA-YATES, R., Eds.
HEDBERG, S. 1996. Searching for the mother
1992. Information Retrieval: Data Struc-
lode: Tales of the first data miners. IEEE
tures and Algorithms. Prentice-Hall, Inc.,
Expert 11, 5 (Oct.), 4 –7.
Upper Saddle River, NJ.
HERTZ, J., KROGH, A., AND PALMER, R. G. 1991.
FRED, A. L. N. AND LEITAO, J. M. N. 1996. A
Introduction to the Theory of Neural Compu-
minimum code length technique for clustering tation. Santa Fe Institute Studies in the Sci-
of syntactic patterns. In Proceedings of the ences of Complexity lecture notes. Addison-
International Conference on Pattern Recogni- Wesley Longman Publ. Co., Inc., Reading,
tion (Vienna, Austria), 680 – 684. MA.
FRED, A. L. N. 1996. Clustering of sequences HOFFMAN, R. AND JAIN, A. K. 1987.
using a minimum grammar complexity crite- Segmentation and classification of range im-
rion. In Grammatical Inference: Learning ages. IEEE Trans. Pattern Anal. Mach. In-
Syntax from Sentences, L. Miclet and C. tell. PAMI-9, 5 (Sept. 1987), 608 – 620.
Higuera, Eds. Springer-Verlag, Secaucus, HOFMANN, T. AND BUHMANN, J. 1997. Pairwise
NJ, 107–116. data clustering by deterministic annealing.
FU, K. S. AND LU, S. Y. 1977. A clustering pro- IEEE Trans. Pattern Anal. Mach. Intell. 19, 1
cedure for syntactic patterns. IEEE Trans. (Jan.), 1–14.
Syst. Man Cybern. 7, 734 –742. HOFMANN, T., PUZICHA, J., AND BUCHMANN, J.
FU, K. S. AND MUI, J. K. 1981. A survey on M. 1998. Unsupervised texture segmenta-
image segmentation. Pattern Recogn. 13, tion in a deterministic annealing framework.
3–16. IEEE Trans. Pattern Anal. Mach. Intell. 20, 8,
FUKUNAGA, K. 1990. Introduction to Statistical 803– 818.
Pattern Recognition. 2nd ed. Academic HOLLAND, J. H. 1975. Adaption in Natural and
Press Prof., Inc., San Diego, CA. Artificial Systems. University of Michigan
GLOVER, F. 1986. Future paths for integer pro- Press, Ann Arbor, MI.
gramming and links to artificial intelligence. HOOVER, A., JEAN-BAPTISTE, G., JIANG, X., FLYNN,
Comput. Oper. Res. 13, 5 (May 1986), 533– P. J., BUNKE, H., GOLDGOF, D. B., BOWYER, K.,
549. EGGERT, D. W., FITZGIBBON, A., AND FISHER, R.
GOLDBERG, D. E. 1989. Genetic Algorithms in B. 1996. An experimental comparison of
Search, Optimization and Machine Learning. range image segmentation algorithms. IEEE
Addison-Wesley Publishing Co., Inc., Red- Trans. Pattern Anal. Mach. Intell. 18, 7, 673–
wood City, CA. 689.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


320 • A. Jain et al.

HUTTENLOCHER, D. P., KLANDERMAN, G. A., AND JONES, D. AND BELTRAMO, M. A. 1991. Solving
RUCKLIDGE, W. J. 1993. Comparing images partitioning problems with genetic algorithms.
using the Hausdorff distance. IEEE Trans. In Proceedings of the Fourth International
Pattern Anal. Mach. Intell. 15, 9, 850 – 863. Conference on Genetic Algorithms, 442– 449.
ICHINO, M. AND YAGUCHI, H. 1994. Generalized JUDD, D., MCKINLEY, P., AND JAIN, A. K.
Minkowski metrics for mixed feature-type 1996. Large-scale parallel data clustering.
data analysis. IEEE Trans. Syst. Man Cy- In Proceedings of the International Conference
bern. 24, 698 –708. on Pattern Recognition (Vienna, Aus-
1991. Proceedings of the International Joint Con- tria), 488 – 493.
ference on Neural Networks. (IJCNN’91). KING, B. 1967. Step-wise clustering proce-
1992. Proceedings of the International Joint Con- dures. J. Am. Stat. Assoc. 69, 86 –101.
ference on Neural Networks. KIRKPATRICK, S., GELATT, C. D., JR., AND VECCHI,
ISMAIL, M. A. AND KAMEL, M. S. 1989. M. P. 1983. Optimization by simulated an-
Multidimensional data clustering utilizing nealing. Science 220, 4598 (May), 671– 680.
hybrid search strategies. Pattern Recogn. 22, KLEIN, R. W. AND DUBES, R. C. 1989.
1 (Jan. 1989), 75– 89. Experiments in projection and clustering by
JAIN, A. K. AND DUBES, R. C. 1988. Algorithms simulated annealing. Pattern Recogn. 22,
for Clustering Data. Prentice-Hall advanced 213–220.
reference series. Prentice-Hall, Inc., Upper KNUTH, D. 1973. The Art of Computer Program-
Saddle River, NJ. ming. Addison-Wesley, Reading, MA.
JAIN, A. K. AND FARROKHNIA, F. 1991. KOONTZ, W. L. G., FUKUNAGA, K., AND NARENDRA,
Unsupervised texture segmentation using Ga- P. M. 1975. A branch and bound clustering
bor filters. Pattern Recogn. 24, 12 (Dec. algorithm. IEEE Trans. Comput. 23, 908 –
1991), 1167–1186. 914.
JAIN, A. K. AND BHATTACHARJEE, S. 1992. Text KOHONEN, T. 1989. Self-Organization and Asso-
segmentation using Gabor filters for auto- ciative Memory. 3rd ed. Springer informa-
matic document processing. Mach. Vision tion sciences series. Springer-Verlag, New
Appl. 5, 3 (Summer 1992), 169 –184. York, NY.
JAIN, A. J. AND FLYNN, P. J., Eds. 1993. Three KRAAIJVELD, M., MAO, J., AND JAIN, A. K. 1995.
Dimensional Object Recognition Systems. A non-linear projection method based on Ko-
Elsevier Science Inc., New York, NY. honen’s topology preserving maps. IEEE
JAIN, A. K. AND MAO, J. 1994. Neural networks Trans. Neural Netw. 6, 548 –559.
and pattern recognition. In Computational KRISHNAPURAM, R., FRIGUI, H., AND NASRAOUI, O.
Intelligence: Imitating Life, J. M. Zurada, R. 1995. Fuzzy and probabilistic shell cluster-
J. Marks, and C. J. Robinson, Eds. 194 – ing algorithms and their application to bound-
212. ary detection and surface approximation.
JAIN, A. K. AND FLYNN, P. J. 1996. Image seg- IEEE Trans. Fuzzy Systems 3, 29 – 60.
mentation using clustering. In Advances in KURITA, T. 1991. An efficient agglomerative
Image Understanding: A Festschrift for Azriel clustering algorithm using a heap. Pattern
Rosenfeld, N. Ahuja and K. Bowyer, Eds, Recogn. 24, 3 (1991), 205–209.
IEEE Press, Piscataway, NJ, 65– 83. LIBRARY OF CONGRESS, 1990. LC classification
JAIN, A. K. AND MAO, J. 1996. Artificial neural outline. Library of Congress, Washington,
networks: A tutorial. IEEE Computer 29 DC.
(Mar.), 31– 44. LEBOWITZ, M. 1987. Experiments with incre-
JAIN, A. K., RATHA, N. K., AND LAKSHMANAN, S. mental concept formation. Mach. Learn. 2,
1997. Object detection using Gabor filters. 103–138.
Pattern Recogn. 30, 2, 295–309. LEE, H.-Y. AND ONG, H.-L. 1996. Visualization
JAIN, N. C., INDRAYAN, A., AND GOEL, L. R. support for data mining. IEEE Expert 11, 5
1986. Monte Carlo comparison of six hierar- (Oct.), 69 –75.
chical clustering methods on random data. LEE, R. C. T., SLAGLE, J. R., AND MONG, C. T.
Pattern Recogn. 19, 1 (Jan./Feb. 1986), 95–99. 1978. Towards automatic auditing of
JAIN, R., KASTURI, R., AND SCHUNCK, B. G. records. IEEE Trans. Softw. Eng. 4, 441–
1995. Machine Vision. McGraw-Hill series 448.
in computer science. McGraw-Hill, Inc., New LEE, R. C. T. 1981. Cluster analysis and its
York, NY. applications. In Advances in Information
JARVIS, R. A. AND PATRICK, E. A. 1973. Systems Science, J. T. Tou, Ed. Plenum
Clustering using a similarity method based on Press, New York, NY.
shared near neighbors. IEEE Trans. Com- LI, C. AND BISWAS, G. 1995. Knowledge-based
put. C-22, 8 (Aug.), 1025–1034. scientific discovery in geological databases.
JOLION, J.-M., MEER, P., AND BATAOUCHE, S. In Proceedings of the First International Con-
1991. Robust clustering with applications in ference on Knowledge Discovery and Data
computer vision. IEEE Trans. Pattern Anal. Mining (Montreal, Canada, Aug. 20-21),
Mach. Intell. 13, 8 (Aug. 1991), 791– 802. 204 –209.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 321

LU, S. Y. AND FU, K. S. 1978. A sentence-to- MURTY, M. N. AND KRISHNA, G. 1980. A compu-
sentence clustering procedure for pattern tationally efficient technique for data cluster-
analysis. IEEE Trans. Syst. Man Cybern. 8, ing. Pattern Recogn. 12, 153–158.
381–389. MURTY, M. N. AND JAIN, A. K. 1995. Knowledge-
LUNDERVOLD, A., FENSTAD, A. M., ERSLAND, L., AND based clustering scheme for collection man-
TAXT, T. 1996. Brain tissue volumes from agement and retrieval of library books. Pat-
multispectral 3D MRI: A comparative study of tern Recogn. 28, 949 –964.
four classifiers. In Proceedings of the Confer- NAGY, G. 1968. State of the art in pattern rec-
ence of the Society on Magnetic Resonance, ognition. Proc. IEEE 56, 836 – 862.
MAAREK, Y. S. AND BEN SHAUL, I. Z. 1996. NG, R. AND HAN, J. 1994. Very large data bases.
Automatically organizing bookmarks per con- In Proceedings of the 20th International Con-
tents. In Proceedings of the Fifth Interna- ference on Very Large Data Bases (VLDB’94,
tional Conference on the World Wide Web Santiago, Chile, Sept.), VLDB Endowment,
(Paris, May), http://www5conf.inria.fr/fich- Berkeley, CA, 144 –155.
html/paper-sessions.html. NGUYEN, H. H. AND COHEN, P. 1993. Gibbs ran-
MCQUEEN, J. 1967. Some methods for classifi- dom fields, fuzzy clustering, and the unsuper-
cation and analysis of multivariate observa- vised segmentation of textured images. CV-
tions. In Proceedings of the Fifth Berkeley GIP: Graph. Models Image Process. 55, 1 (Jan.
Symposium on Mathematical Statistics and 1993), 1–19.
Probability, 281–297. OEHLER, K. L. AND GRAY, R. M. 1995.
MAO, J. AND JAIN, A. K. 1992. Texture classifi- Combining image compression and classifica-
cation and segmentation using multiresolu- tion using vector quantization. IEEE Trans.
tion simultaneous autoregressive models. Pattern Anal. Mach. Intell. 17, 461– 473.
Pattern Recogn. 25, 2 (Feb. 1992), 173–188. OJA, E. 1982. A simplified neuron model as a
MAO, J. AND JAIN, A. K. 1995. Artificial neural principal component analyzer. Bull. Math.
networks for feature extraction and multivari- Bio. 15, 267–273.
ate data projection. IEEE Trans. Neural OZAWA, K. 1985. A stratificational overlapping
Netw. 6, 296 –317. cluster scheme. Pattern Recogn. 18, 279 –286.
MAO, J. AND JAIN, A. K. 1996. A self-organizing OPEN TEXT, 1999. http://index.opentext.net.
network for hyperellipsoidal clustering (HEC). KAMGAR-PARSI, B., GUALTIERI, J. A., DEVANEY, J. A.,
IEEE Trans. Neural Netw. 7, 16 –29. AND KAMGAR-PARSI, K. 1990. Clustering with
MEVINS, A. J. 1995. A branch and bound incre- neural networks. Biol. Cybern. 63, 201–208.
mental conceptual clusterer. Mach. Learn. LYCOS, 1999. http://www.lycos.com.
18, 5–22. PAL, N. R., BEZDEK, J. C., AND TSAO, E. C.-K.
MICHALSKI, R., STEPP, R. E., AND DIDAY, E. 1993. Generalized clustering networks and
1981. A recent advance in data analysis: Kohonen’s self-organizing scheme. IEEE
Clustering objects into classes characterized Trans. Neural Netw. 4, 549 –557.
by conjunctive concepts. In Progress in Pat- QUINLAN, J. R. 1990. Decision trees and deci-
tern Recognition, Vol. 1, L. Kanal and A. sion making. IEEE Trans. Syst. Man Cy-
Rosenfeld, Eds. North-Holland Publishing bern. 20, 339 –346.
Co., Amsterdam, The Netherlands. RAGHAVAN, V. V. AND BIRCHAND, K. 1979. A
MICHALSKI, R., STEPP, R. E., AND DIDAY, clustering strategy based on a formalism of
E. 1983. Automated construction of classi- the reproductive process in a natural system.
fications: conceptual clustering versus numer- In Proceedings of the Second International
ical taxonomy. IEEE Trans. Pattern Anal. Conference on Information Storage and Re-
Mach. Intell. PAMI-5, 5 (Sept.), 396 – 409. trieval, 10 –22.
MISHRA, S. K. AND RAGHAVAN, V. V. 1994. An RAGHAVAN, V. V. AND YU, C. T. 1981. A compar-
empirical study of the performance of heuris- ison of the stability characteristics of some
tic methods for clustering. In Pattern Recog- graph theoretic clustering methods. IEEE
nition in Practice, E. S. Gelsema and L. N. Trans. Pattern Anal. Mach. Intell. 3, 393– 402.
Kanal, Eds. 425– 436. RASMUSSEN, E. 1992. Clustering algorithms.
MITCHELL, T. 1997. Machine Learning. McGraw- In Information Retrieval: Data Structures and
Hill, Inc., New York, NY. Algorithms, W. B. Frakes and R. Baeza-Yates,
MOHIUDDIN, K. M. AND MAO, J. 1994. A compar- Eds. Prentice-Hall, Inc., Upper Saddle
ative study of different classifiers for hand- River, NJ, 419 – 442.
printed character recognition. In Pattern RICH, E. 1983. Artificial Intelligence. McGraw-
Recognition in Practice, E. S. Gelsema and L. Hill, Inc., New York, NY.
N. Kanal, Eds. 437– 448. RIPLEY, B. D., Ed. 1989. Statistical Inference
MOOR, B. K. 1988. ART 1 and Pattern Cluster- for Spatial Processes. Cambridge University
ing. In 1988 Connectionist Summer School, Press, New York, NY.
Morgan Kaufmann, San Mateo, CA, 174 –185. ROSE, K., GUREWITZ, E., AND FOX, G. C. 1993.
MURTAGH, F. 1984. A survey of recent advances Deterministic annealing approach to con-
in hierarchical clustering algorithms which strained clustering. IEEE Trans. Pattern
use cluster centers. Comput. J. 26, 354 –359. Anal. Mach. Intell. 15, 785–794.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


322 • A. Jain et al.

ROSENFELD, A. AND KAK, A. C. 1982. Digital Pic- SPATH, H. 1980. Cluster Analysis Algorithms
ture Processing. 2nd ed. Academic Press, for Data Reduction and Classification. Ellis
Inc., New York, NY. Horwood, Upper Saddle River, NJ.
ROSENFELD, A., SCHNEIDER, V. B., AND HUANG, M. SOLBERG, A., TAXT, T., AND JAIN, A. 1996. A
K. 1969. An application of cluster detection Markov random field model for classification
to text and picture processing. IEEE Trans. of multisource satellite imagery. IEEE
Inf. Theor. 15, 6, 672– 681. Trans. Geoscience and Remote Sensing 34, 1,
ROSS, G. J. S. 1968. Classification techniques 100 –113.
for large sets of data. In Numerical Taxon- SRIVASTAVA, A. AND MURTY, M. N 1990. A com-
omy, A. J. Cole, Ed. Academic Press, Inc., parison between conceptual clustering and
New York, NY. conventional clustering. Pattern Recogn. 23,
RUSPINI, E. H. 1969. A new approach to cluster- 9 (1990), 975–981.
ing. Inf. Control 15, 22–32. STAHL, H. 1986. Cluster analysis of large data
SALTON, G. 1991. Developments in automatic sets. In Classification as a Tool of Research,
text retrieval. Science 253, 974 –980.
W. Gaul and M. Schader, Eds. Elsevier
SAMAL, A. AND IYENGAR, P. A. 1992. Automatic
North-Holland, Inc., New York, NY, 423– 430.
recognition and analysis of human faces and
STEPP, R. E. AND MICHALSKI, R. S. 1986.
facial expressions: A survey. Pattern Recogn.
Conceptual clustering of structured objects: A
25, 1 (Jan. 1992), 65–77.
SAMMON, J. W. JR. 1969. A nonlinear mapping goal-oriented approach. Artif. Intell. 28, 1
for data structure analysis. IEEE Trans. (Feb. 1986), 43– 69.
Comput. 18, 401– 409. SUTTON, M., STARK, L., AND BOWYER, K.
SANGAL, R. 1991. Programming Paradigms in 1993. Function-based generic recognition for
LISP. McGraw-Hill, Inc., New York, NY. multiple object categories. In Three-Dimen-
SCHACHTER, B. J., DAVIS, L. S., AND ROSENFELD, sional Object Recognition Systems, A. Jain
A. 1979. Some experiments in image seg- and P. J. Flynn, Eds. Elsevier Science Inc.,
mentation by clustering of local feature val- New York, NY.
ues. Pattern Recogn. 11, 19 –28. SYMON, M. J. 1977. Clustering criterion and
SCHWEFEL, H. P. 1981. Numerical Optimization multi-variate normal mixture. Biometrics
of Computer Models. John Wiley and Sons, 77, 35– 43.
Inc., New York, NY. TANAKA, E. 1995. Theoretical aspects of syntac-
SELIM, S. Z. AND ISMAIL, M. A. 1984. K-means- tic pattern recognition. Pattern Recogn. 28,
type algorithms: A generalized convergence 1053–1061.
theorem and characterization of local opti- TAXT, T. AND LUNDERVOLD, A. 1994. Multi-
mality. IEEE Trans. Pattern Anal. Mach. In- spectral analysis of the brain using magnetic
tell. 6, 81– 87. resonance imaging. IEEE Trans. Medical
SELIM, S. Z. AND ALSULTAN, K. 1991. A simu- Imaging 13, 3, 470 – 481.
lated annealing algorithm for the clustering TITTERINGTON, D. M., SMITH, A. F. M., AND MAKOV,
problem. Pattern Recogn. 24, 10 (1991), U. E. 1985. Statistical Analysis of Finite
1003–1008. Mixture Distributions. John Wiley and Sons,
SEN, A. AND SRIVASTAVA, M. 1990. Regression Inc., New York, NY.
Analysis. Springer-Verlag, New York, NY. TOUSSAINT, G. T. 1980. The relative neighbor-
SETHI, I. AND JAIN, A. K., Eds. 1991. Artificial hood graph of a finite planar set. Pattern
Neural Networks and Pattern Recognition: Recogn. 12, 261–268.
Old and New Connections. Elsevier Science TRIER, O. D. AND JAIN, A. K. 1995. Goal-
Inc., New York, NY.
directed evaluation of binarization methods.
SHEKAR, B., MURTY, N. M., AND KRISHNA, G.
IEEE Trans. Pattern Anal. Mach. Intell. 17,
1987. A knowledge-based clustering scheme.
1191–1201.
Pattern Recogn. Lett. 5, 4 (Apr. 1, 1987), 253–
UCHIYAMA, T. AND ARBIB, M. A. 1994. Color image
259.
SILVERMAN, J. F. AND COOPER, D. B. 1988. segmentation using competitive learning.
Bayesian clustering for unsupervised estima- IEEE Trans. Pattern Anal. Mach. Intell. 16, 12
tion of surface and texture models. (Dec. 1994), 1197–1206.
IEEE Trans. Pattern Anal. Mach. Intell. 10, 4 URQUHART, R. B. 1982. Graph theoretical clus-
(July 1988), 482– 495. tering based on limited neighborhood
SIMOUDIS, E. 1996. Reality check for data min- sets. Pattern Recogn. 15, 173–187.
ing. IEEE Expert 11, 5 (Oct.), 26 –33. VENKATESWARLU, N. B. AND RAJU, P. S. V. S. K.
SLAGLE, J. R., CHANG, C. L., AND HELLER, S. R. 1992. Fast ISODATA clustering algorithms.
1975. A clustering and data-reorganizing al- Pattern Recogn. 25, 3 (Mar. 1992), 335–342.
gorithm. IEEE Trans. Syst. Man Cybern. 5, VINOD, V. V., CHAUDHURY, S., MUKHERJEE, J., AND
125–128. GHOSE, S. 1994. A connectionist approach
SNEATH, P. H. A. AND SOKAL, R. R. 1973. for clustering with applications in image
Numerical Taxonomy. Freeman, London, analysis. IEEE Trans. Syst. Man Cybern. 24,
UK. 365–384.

ACM Computing Surveys, Vol. 31, No. 3, September 1999


Data Clustering • 323

WAH, B. W., Ed. 1996. Special section on min- WULFEKUHLER, M. AND PUNCH, W. 1997. Finding
ing of databases. IEEE Trans. Knowl. Data salient features for personal web page categories.
Eng. (Dec.). In Proceedings of the Sixth International Con-
WARD, J. H. JR. 1963. Hierarchical grouping to ference on the World Wide Web (Santa Clara,
optimize an objective function. J. Am. Stat. CA, Apr.), http://theory.stanford.edu/people/
Assoc. 58, 236 –244. wass/publications/Web Search/Web Search.html.
WATANABE, S. 1985. Pattern Recognition: Hu- ZADEH, L. A. 1965. Fuzzy sets. Inf. Control 8,
man and Mechanical. John Wiley and Sons, 338 –353.
Inc., New York, NY. ZAHN, C. T. 1971. Graph-theoretical methods
WESZKA, J. 1978. A survey of threshold selec- for detecting and describing gestalt clusters.
tion techniques. Pattern Recogn. 7, 259 –265. IEEE Trans. Comput. C-20 (Apr.), 68 – 86.
WHITLEY, D., STARKWEATHER, T., AND FUQUAY,
ZHANG, K. 1995. Algorithms for the constrained
D. 1989. Scheduling problems and travel-
editing distance between ordered labeled
ing salesman: the genetic edge recombina-
trees and related problems. Pattern Recogn.
tion. In Proceedings of the Third Interna-
tional Conference on Genetic Algorithms 28, 463– 474.
(George Mason University, June 4 –7), J. D. ZHANG, J. AND MICHALSKI, R. S. 1995. An inte-
Schaffer, Ed. Morgan Kaufmann Publishers gration of rule induction and exemplar-based
Inc., San Francisco, CA, 133–140. learning for graded concepts. Mach. Learn.
WILSON, D. R. AND MARTINEZ, T. R. 1997. 21, 3 (Dec. 1995), 235–267.
Improved heterogeneous distance func- ZHANG, T., RAMAKRISHNAN, R., AND LIVNY, M.
tions. J. Artif. Intell. Res. 6, 1–34. 1996. BIRCH: An efficient data clustering
WU, Z. AND LEAHY, R. 1993. An optimal graph method for very large databases. SIGMOD
theoretic approach to data clustering: Theory Rec. 25, 2, 103–114.
and its application to image segmentation. ZUPAN, J. 1982. Clustering of Large Data
IEEE Trans. Pattern Anal. Mach. Intell. 15, Sets. Research Studies Press Ltd., Taunton,
1101–1113. UK.

Received: March 1997; revised: October 1998; accepted: January 1999

ACM Computing Surveys, Vol. 31, No. 3, September 1999

Das könnte Ihnen auch gefallen