Objective functionbased
clustering
Lawrence O. Hall
Clustering is typically applied for data exploration when there are no or very
few labeled data available. The goal is to nd groups or clusters of like data. The
clusters will be of interest to the person applying the algorithm. An objective
functionbased clustering algorithm tries to minimize (or maximize) a function
such that the clusters that are obtained when the minimum/maximum is reached
are homogeneous. One needs to choose a good set of features and the appropri
ate number of clusters to generate a good partition of the data into maximally
homogeneous groups. Objective functions for clustering are introduced. Cluster
ing algorithms generated from the given objective functions are shown, with a
number of examples of widely used approaches discussed.
C
2012 Wiley Periodicals,
Inc.
How to cite this article:
WIREs Data Mining Knowl Discov 2012, 2: 326339 doi: 10.1002/widm.1059
INTRODUCTION
C
onsider the case in which you are given an elec
tronic repository of news articles. You want to
try to determine the future direction of a commodity
like oil, but do not want to sift through the 50,000 ar
ticles by hand. You would like to have them grouped
into categories and then you could browse the ap
propriate ones. You might use the count of words
appearing in the document as features. Having words
such as commodity or oil or wheat appear multiple
times in an article would be good clues as to what it
was concerned with.
Clustering can do the grouping into categories
for you. Objective functionbased clustering is one
way of accomplishing the grouping. In this type of
clustering algorithm, there is a function (the objective
function) which you try to minimize or maximize. The
examples or objects to be partitioned into clusters are
described by a set of s features. To begin, we will think
of all features as being continuous numeric values
such that we can measure distances between them.
A challenge of using objective functionbased
clustering lies in the fact that it is an optimization
problem.
1,2
As such, it is sensitive to the initialization
that is provided. This means that you can get different
i =1
n
j =1
u
i j
D(x
j
, v
i
), (1)
where
x
j
X represents one of n feature vectors of
dimension s;
v
i
V is an s dimensional cluster center, rep
resenting the average value of the examples
assigned to a cluster;
U is the k n matrix of example assignments
to cluster centers and u
ij
U and u
ij
{0, 1}
indicates whether the jth example belongs to
the ith cluster;
k is the number of clusters;
1. Initialize the k cluster centers V
0
,
, ,
T = 1.
2. For all x
j
do
u
i j
= 1 if minarg
i
D(x
j
v
i
)
0 otherwise
3. For i = 1 to k do
v
i
=
n
j= 1
u
i j
x
j
4. if V
T
V
T 1
< stop
else T = T + 1 and go to 2.
(2)
FI GURE 1  kMeans clustering algorithm.
and D(x
j
, v
i
) = x
j
v
i
2
is the distance. For
example, the Euclidean distance which is also
known as the L
2
norm.
In (1), we add up the distances between the ex
amples assigned to a cluster and the corresponding
cluster center. The value J
1
is to be minimized. The
way to accomplish the minimization is to x V or U
and calculate the other and then reverse the process.
This requires an initialization to which the algorithm
is quite sensitive.
1113
The good news is that the algo
rithm is guaranteed to converge.
14
The kmeans clus
tering algorithm is shown in Figure 1. We have used
bold notation to indicate that x, v are vectors and U,
V are matrices. We are going to drop the bold in the
proceeding for convenience expecting the reader will
recall they are vectors or matrices.
Now, we have our rst objectivebased cluster
ing algorithm dened. We can use it to cluster a four
dimensional, three class dataset as an illustrative ex
ample. Here, the Weka data mining tool
15
has been
used to cluster and display the Iris data.
16
The dataset
shown in Figure 2 describes Iris plants and has 150
examples with 50 examples in each of three classes.
There were four numeric features measuredsepal
length, sepal width, petal length, and petal width. The
projection here is into two dimensions, petal length
and petal width. You can see it looks like there might
be two classes, as 2 overlap in Figure 2(a) with one
clearly separate. In Figure 2(b), you see a partition of
the data into three classes and in Figure 2(c) a dif
ferent partition (from a different initialization, thus
illustrating the sensitivity to initialization).
It is important to note that the expert who
created the Iris dataset recognized the three classes
of Iris owers. However, the features recorded do
not necessarily disambiguate the classes with 100%
accuracy. This is true for many more complex
Vol ume 2, J ul y/ August 2012 327 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
FI GURE 2  The Iris data (a) with labels (b) clustered by kmeans with a Good initialization and (c) clustered by kmeans with a Bad
initialization.
realworld problems. The point is that, without la
bels, the features may give us a different number of
classes than the known number. In this case, we might
want better features (or a different algorithm).
Note, that for the Iris data some claim the fea
tures really only allow for 2 clusters.
13,17
Realworld
ground truth tells us that there are three clusters or
classes. Which is correct? Perhaps both with the given
features?
FUZZY kMEANS
If you allowan example x
j
to partially belong to more
than one cluster with a membership in cluster i of
i
(x
j
) [0, 1], this is called a fuzzy membership.
18
Using fuzzy memberships allows the creation of the
fuzzy kmeans (FKM) algorithm in which each ex
ample has some membership in each cluster. The
algorithm was originally called fuzzy cmeans. Like
kmeans an educated guess of the number of
clusters using domain knowledge is required of
the user.
The objective function for FKMis J
m
in Eq. (3):
J
m
(U, V) =
k
i =1
n
j =1
w
j
u
m
i j
D(x
j
, v
i
), (3)
where u
ij
is the membership value of the jth example,
x
j
, in the ith cluster; v
i
is the ith cluster centroid; n is
the number of examples; k is the number of clusters;
and m controls the fuzziness of the memberships with
very close to one causing the partition to be nearly
crisp or approximate kmeans. Higher values cause
fuzzier partitions spreading the memberships across
more clusters:
D(x
j
, v
i
) = x
j
v
i
2
is the norm, for example,
the Euclidean distance.
w
j
is the weight of the jth example. For FKM,
w
j
= 1, j. We will use this value, which is not typi
cally shown, later.
328 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective functionbased clustering
1. l = 0
2. Initialize the cluster centers (v
i
s) to get V
0
.
3. Repeat
l = l +1,
calculate U
l
according to Eq. (4)
calculate V
l
according to Eq. (5)
4. Until V
l
V
l1
<
FI GURE 3  FKM algorithm.
U and V can be calculated as
u
i j
=
D
_
x
j
, v
i
_ 1
1m
k
i =1
D
_
x
j
, v
i
_ 1
1m
, (4)
v
i
=
n
j =1
w
j
(u
i j
)
m
x
j
n
j =1
w
j
(u
i j
)
m
. (5)
The clustering algorithm is shown in Figure 3.
There is some extra computation when compared to
kmeans and we must choose the value for m. There
are many papers that describe approaches to choosing
m,
19,20
some of which are automatic. Adefault choice
of m = 2 often works reasonably well as a starting
point. There is a convergence theorem
21
for FKMthat
shows that it ends up in local minima or saddle points,
but will stop iterating.
This algorithm is very useful if you know you
have examples that truly are a mixture of classes.
It also allows for the easy identication of exam
ples that do not t well into a single cluster and
provides information on how well any example ts
a cluster. An interesting case of its use is with the
previously mentioned Iris dataset. FKM in many ex
periments (over 5000 random initializations) always
converged to the same nal partition which had 16
errors when searching for three clusters as shown in
Figure 2. kMeans converged to one of three parti
tions with the most likely one the same as FKMs.
13
The other two were local extrema that resulted in sig
nicantly higher values of J
1
, one of which is shown in
Figure 2. Of course, both algorithms are sensitive to
initialization and for other datasets FKM will not al
ways converge to the same solution.
The FKM algorithm with the Euclidean dis
tance function has a bias toward hyperspherical clus
ters that are equal sized. If you change the distance
function
22,23
hyperellipsoidal clusters can be found.
EXPECTATION MAXIMIZATION
CLUSTERING
Consider the case in which you want to assign prob
abilities to whether an example is in one cluster or
another. We might have the case that x
5
belongs to
cluster A with a probability of 0.9, cluster B with a
probability of 0, and cluster C with a probability of
0.1 for a three cluster or class problem. Note, without
labels the cluster designations are arbitrary. The clus
tering algorithm to use is based on the idea of expec
tation maximization
24
or nding the maximum like
lihood solution for estimating the model parameters.
We are going to give a simplied clustering focused
version of the algorithmhere and more general details
can be found in Refs 15 and 24. The algorithm does
come with a convergence proof
25
which guarantees
that we can nd a solution (although not necessarily
the optimal solution) under conditions that usually
are not violated by real data.
We want to nd the set of parameters that max
imize the log likelihood of generating our data X.
Now will consist of our probability function
for examples belonging to classes which, in the sim
plest case, requires us to nd the centroid of clus
ters and the standard deviation of them. More gen
erally, the necessary parameters to maximize Eq. (6)
are found:
= argmax
i =1
log(P(x
i
)). (6)
Let p
j
be the probability of class j. Let z
ij
= 1, if
example i belongs to class j and 0, otherwise
Now our objective function will be
L(X, ) =
n
i =1
k
j =1
z
i j
log( p
j
P(x
i
 j )). (7)
Now, how do we calculate P(x
i
j)? A simple for
mulation is given in Eq. (8) using a Gaussianbased
distance:
P(x
i
 j ) = f (x
i
;
j
,
j
) =
1
_
2
j
e
(x
i
j
)
2
2
2
j
. (8)
This works for roughly balanced classes with
a spherical shape in feature space. A more general
description which works better for data that does not
t the constraints of the previous sentence is given in
Ref 26. Our objective function depends on and .
We observe that
p
j
=
n
i =1
P(x
i
 j )/n. (9)
Vol ume 2, J ul y/ August 2012 329 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
1. Initialize the
j
s and
j
s by running one iteration of kmeans and taking the kcluster centers
and their standard deviations. Initialize and L
0
=
2. Repeat
3. Estep: Calculate P(x
i
 j) as in Eq. (8)
, where L is calculated from (7)
4. Mstep: n
j
=
n
i=1
P(x
i
 j), 1 j k
p
j
= n
j
/ n
j
=
n
i=1
P(x
i
 j)x
i
n
j
, 1 j k
j
=
n
i=1
P(x
i
 j)(x
i
j
)
2
n
i=1
P(x
i
 j)
5. Until  L
t
L
t 1
 <
FI GURE 4  The EM clustering algorithm.
FI GURE 5  The Iris data clustered by the EM algorithm.
The EM algorithm is shown in Figure 4. We
have applied the EM algorithm as implemented in
Weka to the Iris data. A projection of the partition
obtained when searching for three classes is shown
in Figure 5. The nal partition differs, albeit slightly
from that found in Figure 2. However, it is interesting
that on even this simple dataset there are disagree
ments which, unsurprisingly, involve the two over
lapping classes.
POSSIBILISTIC kMEANS
The clustering algorithm discussed in this section was
designed to be able to cluster data like that shown
in Figure 6. Visually, there is a bar with a spherical
object at each end. A person would most likely say
there are three clusters, the two spheres and the linear
bar. The problem for the clustering algorithms dis
FI GURE 6  Three cluster problem that is difcult.
cussed thus far is that they must nd very different
shapes (spherical and linear) and are not designed to
do so. The possibilistic kmeans (PKM) algorithmcan
nd nonspherical clusters together with ellipsoidal or
spherical clusters.
27
The algorithm, originally named
possibilistic cmeans, is also signicantly more noise
tolerant than FKM.
28,29
The approach is more computationally complex
and requires some attention to parameter setting.
28
The innovation is to view the examples as being pos
sible members of clusters. Possibility theory is uti
lized to create the objective function.
30
So, an exam
ple might have a possibility of 1 (potentially complete
belonging) to more than one cluster. The membership
value can also be viewed as the compatibility of the
assignment of an example to a cluster.
The objective function for PKM looks like that
for FKM with an extra term and some different
constraints on the membership values as shown in
Eq. (10). The second term forces the u
ij
to be as large
as possible to avoid the trivial solution:
J
m
(U, V) =
k
i =1
n
j =1
u
m
i j
D(x
j
, v
i
)
+
k
i =1
i
n
u=1
(1 u
i j
)
m
, (10)
where,
i
are suitable positive numbers, u
ij
: is the
membership value of the jth example, x
j
, in the
ith cluster such that u
ij
[0, 1], 0 <
n
j =1
u
i j
n
and max
j
u
i j
> 0 i . Critically, the memberships are
330 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective functionbased clustering
1. Run FKM for one full iteration. Choose m and k.
2. l = 0
3. Initialize the cluster centers (v
i
s) from FKM to get V
0
.
4. Estimate
i
using Eq. (12).
5. Repeat
l = l + 1
calculate U
l
according to Eq. (11)
calculate V
l
according to Eq. (5)
6. Until V
l
V
l 1
<
Now, if you want to know the shapes of the possibility distributions, rst reestimate
i
with Eq.
(13). This is optional.
1. Repeat
l = l + 1
calculate U
l
according to Eq. (11)
calculate V
l
according to Eq. (5)
2. Until V
l
V
l 1
<
FI GURE 7  PKM algorithm.
now not constrained to add to 1 across classes with
the only requirement being that every example have
nonzero membership (or possibility) for at least 1
cluster. v
j
is the jth cluster centroid; n is the num
ber of examples; k is the number of clusters; m affects
the possibilistic membership values with closer to 1
making the results closer to a hard partition (which
has memberships of only 0 or 1); and D(x
j
, v
i
) =
x
j
v
i
2
is the norm, such as the Euclidean distance.
Now, the calculation for the cluster centers is
still done by Eq. (5) with w
j
= 1j. The calculation
for the possibilsitic memberships is given by Eq. (11):
u
i j
=
1
1 +
_
D(x
j
,v
i
)
i
_ 1
m1
. (11)
The value of
i
has the effect of determining the
distance at which an examples membership becomes
0.5. It should be chosen based on the bandwidth of
the desired membership distribution for a cluster. In
practice, a value proportional to the average fuzzy
intracluster distance can work as in Eq. (12). The
authors of the approach note that R = 1 is a typical
choice:
i
= R
n
j =1
(u
i j
)
m
D(x
j
, v
i
)
n
j =1
(u
i j
)
m
. (12)
The PKM algorithm is quite sensitive to the
choice of m.
28,29
You can x the
i
s or calculate them
each iteration. When xed, you have guaranteed con
vergence. The algorithm that will allow you to nd
clusters such as in Figure 6 benets from the use of
Eq. (13) to generate after convergence is achieved
using Eq. (12). Consider the
i
to contain all the
memberships for the ith cluster. Then (
i
)
contains
all membership values above and in terms of possi
bility theory is called an cut. So, for example with
an 0.5 you get a good set of members that have
a pretty strong afnity for the cluster:
i
= R
x
j
(
i
)
D(x
j
, v
i
)
(
i
)

. (13)
The algorithm for PKM is shown in Figure 7.
A nice advantage of this algorithm is its perfor
mance when your dataset is noisy. As well as being
able to nicely extract different shapes, although this
typically requires a distance function that is a bit more
complex than the Euclidean distance. Two other dis
tance functions that can be used are shown in Eqs (14)
and (16). The scaled Mahalanobis distance
31
allows
for nonspherical shapes. If you have spherical shells
potentially in your data, then the distance measure
32
shown in Eq. (16) can be effective. However, the in
troduction of the radius into the distance measure
requires a new set of updated equations for nding
the cluster centers, which can be found in Ref 27:
D
i j
= F
i

1/n
(x
j
v
i
)
T
F
1
i
(x
j
v
i
), (14)
Vol ume 2, J ul y/ August 2012 331 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
1. Given R = [ r
i j
], initialize 2 k < n, and initialize U
0
M
k
 u
i j
{0, 1} with the constraints
of Eq. (17) holding and T = 1.
2. Calculate the kmean vectors v
i
to create V
T
as
(19)
(20)
v
i
= ( u
i1
, u
i2
, . . . , u
in
)
T
/
n
j= 1
u
i j
3. Update U
T
using Eq. (4) where the distance is
D(x
j
, v
i
) = ( Rv
i
)
j
(v
T
i
Rv
i
)/ 2
4. If U
T
U
T 1
< stop
else T = T + 1 and go to 2.
FI GURE 8  Relational kmeans clustering algorithm.
where F
i
is the fuzzy covariance matrix of cluster v
i
and can be updated with Eq. (15):
F
i
=
n
j =1
u
m
i j
(x
j
v
i
)(x
j
v
i
)
T
n
j =1
u
m
i j
, (15)
D
i j
= (x
j
v
i
)
1/2
r
i
)
2
, (16)
where r
i
is the radius of cluster i. With new calcu
lations to nd the cluster centers, this results in an
algorithm called possibilistic kshells. It is quite effec
tive if you have a cluster within a cluster (like a big O
containing a small o).
There are a number of alternative formulations
of possibilistic clustering such as that in Ref 33 where
the authors argue their approach is less sensitive to
parameter setting. In Ref 34, a mutual cluster repul
sion term is introduced to solve the technical problem
of coincident cluster centers providing the best mini
mization and it introduces some other potentially use
ful properties.
The Mahalanobis distance measure can be used
in kmeans (with just the covariance matrix), FKM
and EM as well. In FKM, the use of Eq. (14) as the
distance measure gives the so called GK clustering
algorithm,
23
which is known for its ability to capture
hyperellipsoidal clusters.
RELATIONAL CLUSTERING
How do we cluster data for which the attributes are
not all numeric? Relational clustering is one approach
that can be used stand alone or in an ensemble of clus
tering algorithms.
35,36
Cluster ensembles can provide
other options for dealing with mixed attribute types.
How about if all we know is how examples relate to
one another in terms of how similar they are? Rela
tional clustering works when x
i
X, (x
i
, x
j
) = r
ij
[0, 1]. We can think of as a binary fuzzy relation. R
=[r
ij
] is a fuzzy relation (or a typical relation matrix if
r
ij
{0, 1}). Relational clustering algorithms are typ
ically associated with graphs because R can always be
viewed as the adjacency matrix of a weighted digraph
on the n examples (nodes) in X.
Graph clustering is not typically addressed with
an objective functionbased clustering approach.
37
However, there are relational versions of kmeans and
FKM
38
which are applicable to graphs and will be dis
cussed. First, our U matrix of example memberships
in clusters can be put into a context that allows for
both the kmeans and FKM to be described with one
objective function:
M
f k
= {U R
kn
0 u
i j
1,
k
i =1
u
i j
= 1, for 1 j n, (17)
n
j =1
u
i j
> 0, for 1 i k}
In Eq. (17) the memberships can be fuzzy or in
{0, 1}, called crisp. The same constraints as for FKM
and kmeans hold. Our objective function is
JR
m
(U) =
k
i =1
_
_
n
j =1
n
l=1
u
m
i j
u
m
il
r
jl
2
n
t=1
u
m
i t
_
_
, (18)
where m 1. If we have numeric data, we can create
r
il
=
2
il
= x
i
x
l
2
for some distance function. The
square just ensures a positive number. The algorithm
for relational kmeans is shown in Figure 8.
This algorithm has a convergence proof based
on the nonrelational case. It allows us to do graph
332 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective functionbased clustering
1. Given R = [ r
i j
], initialize 2 k < n, and initialize U
0
M
k
 u
i j
[0, 1] with the constraints
of (17) holding and T = 1. Choose m > 1.
2. Calculate the k mean vectors v
i
to create V
T
as
v
i
= (u
m
i1
, u
m
i2
, . . . , u
m
in
)
T
/
n
j= 1
u
m
i j
3. Update U
T
using (2) where the distance is
D(x
j
, v
i
) = (Rv
i
)
j
(v
T
i
Rv
i
)/ 2
4. If U
T
U
T 1
< stop
else T = T + 1 and go to 2.
(21)
(22)
FI GURE 9  Relational fuzzy kmeans clustering algorithm.
clustering using an objective functionbased cluster
ing algorithm. The fuzzy version (with the member
ships relaxed to be in [0, 1]) is shown in Figure 9.
It also converges
38
and provides a second option
for relational clustering with objective functionbased
algorithms.
Another approach to fuzzy relational clustering
is fuzzy medoid clustering.
39
Aset of k fuzzy represen
tatives fromthe data (medoids) are found to minimize
the dissimilarity in the created clusters. The approach
typically requires less computational time than that
described here.
ADJUSTING THE PERFORMANCE OF
OBJECTIVE FUNCTIONBASED
ALGORITHMS
The algorithms discussed thus far are very good clus
tering algorithms. However, for the most part, they
have important limitations. With the exception of
PKM, noise will have a strong negative effect on them.
Unless otherwise noted there is a builtin bias for hy
perspherical clusters that are equal sized. That is a
problem if you have, for example, a cluster of interest
which is rare and, hence, small.
To get different cluster shapes, the distance mea
sure can be changed. We have seen an example of
the Mahalanobis distance and discussed the easy to
compute Euclidean distance. There are lots of other
choices, such as given in Ref 40. Any that involve
the use of the covariance matrix of a cluster, or some
variation of it, typically will not require changes in
the clustering algorithm.
For example, we can change our probability cal
culation in EM to be as follows:
P(x
i
 j ) = f
_
_
x
i
;
j
,
j
_
_
=
exp{
1
2
(x
i
j
)
T
1
j
(x
i
j
)}
(2)
s/2

1/2
j

, (23)
where
j
is the covariance matrix for the jth cluster,
s is the number of features, and
j
is the centroid
(average) of cluster j. This gives us ellipsoidal clusters.
Now, if we want to have different shapes for clusters
we can look at parameterizations of the covariance
matrix. An eigenvalue decomposition is
j
=
j
D
j
A
j
D
T
j
, (24)
where D
j
is the orthogonal matrix of eigenvectors, A
j
is a diagonal matrix whose elements are proportional
to the eigenvalues of
j
, and
j
is a scalar value.
41
This formulation can be used in kmeans and FKM
and really PKM with a fuzzy covariance matrix for
the latter two. It is a very exible formulation.
The orientation of the principal components of
j
is determined by D
j
, the shape of the density con
tours is determined by A
j
. Then
j
species the volume
of the corresponding ellipsoid, which is proportional
to
s
j
 A
j
. The orientation, volume and shape of the
distributions can be estimated from the data and be
allowed to vary for each cluster (although you can
constrain them to be the same for all).
In Ref 22, a fuzzied version of Eq. (23) is
given and modications are made to FKM to result
in the socalled GathGeva clustering algorithm. The
Vol ume 2, J ul y/ August 2012 333 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
1. l = 0
2. Initialize U
0
, perhaps with 1 iteration of FKM or kmeans. Initialize m.
3. Repeat
l = l + 1
For FKM: calculate U
l
according to Eq. (4) using Eq. (29) for the distance. Note
that the distances depend on the previous membership values. You might add a step to
calculate them all to improve computation time, if you like.
For kmeans: calculate U
l
according to Eq. (2) using Eq. (29) for the distance with
m = 1.
4. Until U
l
U
l 1
<
FI GURE 10  Kernelbased kmeans/FKM algorithm.
algorithm also has a builtin method to discover the
number of clusters in the data, which will be ad
dressed in the proceeding.
KernelBased Clustering
Another interesting way to change distance functions
is to use the kernel trick
42
associated with support
vector machines which are trained with labeled data.
A very simplied explanation of the idea, which is
well explained by Burges,
43
is the following. Con
sider projecting the data into a different space, where
it may be more simply separable. From a clustering
standpoint, we might think of a three cluster problem
where the clusters are touching in some dimensions.
When projected they may nicely group for any clus
tering algorithm to nd them.
44
Now consider : R
s
H to be a nonlin
ear mapping function from the original input space
to a highdimensional feature space H. By applying
the nonlinear mapping function , the dot product
x
i
x
j
in our original space is mapped to (x
i
) (x
j
)
in feature space. The key notion in kernelbased learn
ing is that the mapping function need not be explic
itly specied. The kernel function K(x
i
, x
j
) in the orig
inal space R
s
can be used to calculate the dot product
(x
i
) (x
j
).
First, we introduce 3 (of many) potential kernels
which satisfy Mercers condition
43
:
K(x
i
, x
j
) = (x
i
x
j
+b)
d
, (25)
where d is the degree of the polynomial and b is some
constant offset;
K(x
i
, x
j
) = e
x
i
x
j
2
2
2
, (26)
where
2
is a variance parameter;
K(x
i
, x
j
) = tanh((x
i
x
j
) +), (27)
where and are constants that shape the sigmoid
function.
Practically, the kernel function K(x
i
, x
j
) can be
integrated into the distance function of a clustering
algorithm which changes the update equations.
45,46
The most general approach is to construct the cluster
center prototypes in kernel space
46
because it allows
for more kernel functions to be used. Here, we will
take a look at hard and FKM approaches to objec
tive functionbased clustering with kernels. Now our
distance becomes D(x
j
, v
i
) = (x
j
) v
i
2
. So our
objective function reads as
J
K,m
=
k
i =1
n
j =1
u
m
i j
(x
j
) v
i
2
. (28)
For kmeans, we just need m = 1 with u
ij
{0, 1}
as usual. We see our objective function has (x
j
) in it
so our update equations and distance equation must
change. Before discussing the modied algorithm, we
introduce the distance function.
Our distance will be as shown belowEq. (29) for
the fuzzy case. It is simple to modify for kmeans. It is
important to note that the cluster centers themselves
do not show up in the distance equation:
D(x
j
, v
i
) = K(x
j
, x
j
) 2
n
l=1
u
m
il
K(x
l
, x
j
)
n
l=1
u
m
il
+
n
q=1
n
l=1
u
m
i q
u
m
il
K(x
q
, x
l
)
_
n
q=1
u
m
i q
_
2
. (29)
The algorithm for kmeans and FKM is then
shown in Figure 10.
Choosing the Right Number of Clusters:
Cluster Validity
There are many functions
4749
that can be applied to
evaluate the partitions produced by a clustering algo
rithm using different numbers of cluster centers. The
silhouette criterion
50
and fuzzy silhouette criterion
51
measure the similarity of objects to others in their
334 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective functionbased clustering
1. Set initial number of clusters I, typically 2. Set maximum number of clusters MC, MC << n
unless something is unusual. Initialize T = I, k = 0. Choose the validity metric to be used
and parameters for it.
2. While (T MC and k = = 0) do
3. Cluster data into T clusters
4. k = checkvalidity /* Returns the number of clusters if applicable or 0. */
5. T = T + 1
Return (IF (k = = 0) return MC ELSE return k)
FI GURE 11  Finding the right number of clusters with a partition validity metric. Any validity metric that applies to a particular objective
functionbased clustering algorithm can be applied.
own cluster against the nearest object from another.
A nice comparison for kmeans and some hierarchi
cal approaches is found in Ref 52. Perhaps the sim
plest (but far from only approach) is that shown in
Figure 11. Here you start with 2 clusters and run the
clustering algorithm to completion (or enough itera
tions to have a reasonable partition), then increase to
3, 4, . . ., MC clusters and repeat the process. A valid
ity metric is applied to each partition and can be used
to pick out the right one according to it. This will
typically be determined at the point when the declin
ing or increasing value of the metric changes direction
(increases/decreases). Note that you cannot use the
objective function because it will prefer many clusters
(sometimes as many as there are examples).
A very good, simple cluster validity metric
for fuzzy partitions of data is the XieBeni index
[Eq. (30)].
53,54
It uses the value m with u
ij
and you
can set m to the same value as used in clustering or
simply a default (say m = 2). The search is for the
smallest S.
S =
k
i =1
n
j =1
u
m
i j
D(x
j
, v
i
)
n min
i j
{D(v
i
, v
j
)}
. (30)
The generalized Dunn index
47
has proved to be
a good validity metric for kmeans partitions. One
good version of it is shown in Eq. (31) with a nice
discussion given in Ref 47. It looks at between cluster
scatter versus within cluster scatter measured by the
biggest intercluster distance.
S
gd
= min
1sk
_
min
1tk,t=s
_
zv
s
,wv
t
D(z, w)
v
s
 v
t
 max
1lk
(max
x,yv
l
D(x, y))
__
. (31)
A relatively new approach to determining the
number of clusters involves the user of the clustering
algorithm. These approaches are called visual assess
ment techniques.
55,56
The examples or objects are or
dered to reect dissimilarities (rows and columns in
the dissimilarity matrix are moved). The newly or
dered matrix of pairwise example dissimilarities is dis
played as an intensity image. Clusters are indicated by
dark blocks of pixels along the diagonal. The viewer
can decide on how many clusters one sees.
Scaling Clustering Algorithms to Large
Datasets
The clustering algorithms, we have discussed, take a
signicant amount of time when the number of ex
amples or features or both are large. For example,
kmeans run time requires checking the distance of
n examples of dimension s against k cluster centers
with an update of the (n k) U matrix during each
iteration. The run time is proportional to (nsk + nk)t
for t iterations. Using big O notation,
57
the average
run time is O(nskt). This is linear in n it is true, but
the distances are computationally costly to compute.
As n gets very large, we would like to reduce the time
required to accomplish clustering.
Clustering can be sped up with distributed
approaches
5860
where the algorithm is generalized
to work on multiple processors. An early approach to
speeding up kmeans clustering is given in Ref 61.
They provide a fourstep process shown below to
allow just one pass through the data assuming a
maximumsized piece of memory can be used to store
data as shown in Figure 12. An advantage of this
approach is that you are only loading the data from
disk one time, meaning it can be much larger than the
available memory. The clustering is done in step 2.
Vol ume 2, J ul y/ August 2012 335 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
1. Obtain next available (possibly random) sample from the dataset and put it in free memory
space. Initially, you may ll it all.
2. Update current model using data in memory.
3. Based on the updated model, classify the loaded examples elements as
(a) needs to be retained in memory
(b) can be discarded with updates to the sufcient statistics
(c) can be reduced via compression and summarized as sufcient statistics
4. Determine if stopping criteria are satised. If so, terminate; else go to 1.
FI GURE 12  An approach to speeding up kmeans (applied in step 2) with one pass through the data.
Now to speed up FKM, the singlepass algo
rithm can be used.
62,63
It makes use of the weights
shown in Eq. (3). The approach is pretty simple. Break
the data into c chunks.
(1) Cluster the rst chunk.
(2) Create weights for the cluster centers based
on the amount of membership assigned to
them with Eq. (32). Here, n
d
is the number
of examples being clustered for a chunk of
data.
(3) Bring in the next chunk of data and the k
weighted cluster centers from the previous
step and apply FKM.
(4) Go to step 2 until all chunks are processed.
So, if c = 10, in the rst step you process 10%
of the n examples and n
d
= 0.1n. In the next 9 steps,
there are n
d
= 0.1n + k examples to cluster:
w
j
=
n
d
l=1
u
jl
w
l
, 1 j k. (32)
SUMMARY AND DISCUSSION
Objective functionbased clustering describes an ap
proach to grouping unlabeled data in which there is a
global function to be minimized or maximized to nd
the best data partition The approach is sensitive to
initialization and a brief example of this was given. It
is necessary to specify the number of clusters for these
approaches, although algorithms can easily be built
22
to incorporate methods to determine the number of
clusters.
Four major approaches to objective function
based clustering have been coveredkmeans, FKM,
PKM, and expectation maximization. Relational clus
tering can be done with any of the major clustering
algorithms by using a matrix of distances between
examples based on some relation. We have briey
discussed the importance of the distance measure in
determining the shapes of the clusters that can be
found.
A kernelbased approach to objective function
based clustering has been introduced. It has the
promise, with the choice of the right kernel function,
of allowing most any shape cluster to be found.
A section on how to determine the right number
of clusters (cluster validity) was included. It shows just
two of many validity measures; however, they both
have performed well.
With the exception of PKM the other ap
proaches discussed here are sensitive to noise. Most
clustering problems are not highly noisy. If yours is
and you want to use objective functionbased tech
niques, take a look at Refs 46 that contain modied
algorithms that are essential for this problem.
For a clustering problem, one needs to think
about the data and choose a type of algorithm. The ex
pected shape of clusters, number of features, amount
of noise, whether examples of mixed classes exist,
and amount of data are among the critical consider
ations. You can choose some expected values for, k,
the number of clusters or use a validity function to
tell you the best number. You might want to try mul
tiple initializations and take the lowest (highest) value
from the objective function which indicates the best
partition. There are a number of publicly available
clustering algorithms that can be tried. In particular,
the freely available Weka
15
data mining tool has sev
eral, including kmeans and EM. A couple of fuzzy
clustering algorithms are available.
64,65
There are a lot of approaches that are not dis
cussed here. This includes time series clustering.
6668
Some other notable ones are in Refs 6974. They all
contain some advances that might be helpful for your
problem. Happy clustering.
336 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective functionbased clustering
ACKNOWLEDGMENTS
This work was partially supported by grant 1U01CA143062 01, Radiomics of NSCLC from
the National Institutes of Health.
REFERENCES
1. Jain A, Dubes R. Algorithms for Clustering Data. Up
per Saddle River, NJ: PrenticeHall; 1998.
2. Jain AK, Murty MN, Flynn PJ. Data clustering: a re
view. ACM Comput Surv 1999, 31:264323.
3. Blei DM, Ng AY, Jordan MI. Latent Dirichlet alloca
tion. J Mach Learn Res 2003, 3:9931022.
4. Dave RN, Krishnapuram R. Robust clustering meth
ods: a unied view. IEEE Trans Fuzzy Syst 1997,
5:270293.
5. Kim J, Krishnapuram R, Dave R. Application of the
least trimmed squares technique to prototypebased
clustering. Pattern Recognit Lett 1996, 17:633641.
6. Wu KL, Yang MS. Alternative cmeans clustering al
gorithms. Pattern Recognit 2002, 35:22672278.
7. Banerjee A, Merugu S, Dhillon IS, Ghosh J. Clustering
with Bregman divergences. J Mach Learn Res 2005,
6:17051749.
8. MacQueen J. Some methods for classication and anal
ysis of multivariate observations. In: Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics
and Probability. Los Angeles, CA: University of Cali
fornia Press; 1967, 281297.
9. Jain AK. Data clustering: 50 years beyond kmeans.
Pattern Recognit Lett 2010, 31:651666. Award win
ning papers from the 19th International Conference on
Pattern Recognition (ICPR).
10. Steinbach M, Karypis G, Kumar V. A comparison of
document clustering techniques. In: The Sixth ACM
SIGKDDInternational Conference on Knowledge Dis
covery and Data Mining; Boston, MA; August 2023,
2000.
11. Redmond SJ, Heneghan C. Amethod for initialising the
kmeans clustering algorithm using kdtrees. Pattern
Recognit Lett 2007, 28:965973.
12. He J, Lan M, Tan CL, Sung SY, Low HB. Ini
tialization of cluster renement algorithms: a review
and comparative study. In: 2004 IEEE International
Joint Conference on Neural Networks; 2004, 1
4:(xlvii+3302).
13. Hall LO, Ozyurt IB, Bezdek JC. Clustering with a
genetically optimized approach. IEEE Trans Evolut
Comput 1999, 3:103112.
14. SelimSZ, Ismail MA. Kmeanstype algorithms: A gen
eralized convergence theorem and characterization of
local optimality. IEEE Trans Pattern Anal Mach Intell
1984, PAMI6(1):8187.
15. Witten IH, Frank E. Data Mining: Practical Machine
Learning Tools and Techniques. 2nd ed. San Francisco:
Morgan Kaufmann; 2005.
16. Bezdek JC, Keller JM, Krishnapuram R, Kuncheva LI,
Pal NR. Will the real iris data please stand up? IEEE
Trans Fuzzy Syst 1999, 7:368369.
17. Kothari R, Pitts D. On nding the number of clusters.
Pattern Recognit Lett 1999, 20:405416.
18. Kandel A. Fuzzy Mathematical Techniques With Ap
plications. Boston, MA: AddisonWesley; 1986.
19. Wu KL. Analysis of parameter selections for
fuzzy cmeans. Pattern Recognit 2012, 45:407415.
http://dx.doi.org/10.1016/j.patcog.2011.07.01
20. Yu J, Cheng Q, Huang H. Analysis of the weighting ex
ponent in the FCM. Syst Man Cybern, Part B: Cybern
2004, 34:634639.
21. Bezdek J, Hathaway R, Sobin M, Tucker W. Conver
gence theory for fuzzy cmeans: counterexamples and
repairs. IEEE Trans Syst Man Cybern 1987, 17:873
877.
22. Gath I, Geva AB. Unsupervised optimal fuzzy clus
tering. IEEE Trans Pattern Anal Mach Intell 1989,
11:773780.
23. Gustafson DE, Kessel WC. Fuzzy clustering with a
fuzzy covariance matrix. In: Proc IEEE CDC 1979,
761766.
24. Dempster AP, Laird NM, Rubin DB. Maximum likeli
hood from incomplete data via the em algorithm. J R
Stat Soc Ser B 1997, 39:138.
25. Wu CFJ. On the convergence properties of the EM
algorithm. Ann Stat 1983, 11:95103.
26. Fraley C, Raferty AE. How many clusters? Which clus
tering method? Answers via modelbased cluster anal
ysis. Comput J 1998, 41:579588.
27. Krishnapuram R, Keller JM. A possibilistic approach
to clustering. IEEE Trans Fuzzy Syst 1993, 1:98
110.
28. Krishnapuram R, Keller JM. The possibilistic cmeans
algorithm: insights and recommendations. IEEE Trans
Fuzzy Syst 1996, 4:385393.
29. Barni M, Cappellini V, Mecocci A. Comments on a
possibilistic approach to clustering. IEEE Trans Fuzzy
Syst 1996, 4:393396.
30. Dubois D, Prade H. Possibility theory, probability the
ory and multiplevalued logics: a clarication. Ann
Math Artif Intell 2001, 32:3566.
Vol ume 2, J ul y/ August 2012 337 c 2012 J ohn Wi l ey & Sons, I nc.
Advanced Review wires.wiley.com/widm
31. Sung KK, Poggio T. Examplebased learning for view
based human face detection. IEEE Trans Pattern Anal
Mach Intell 1998, 20:3951.
32. Krishnapuram R, Nasraoui O, Frigui H. The fuzzy
c spherical shells algorithm: a new approach. IEEE
Trans Neural Netw 1992, 3:663671.
33. Yang MS, Wu KL. Unsupervised possibilistic cluster
ing. Pattern Recognit 2006, 39:521.
34. Timm H, Borgelt C, Doring C, Kruse R. An extension
to possibilistic fuzzy cluster analysis. Fuzzy Sets Syst
2004, 147:316.
35. Ghosh J, Acharya A. Cluster ensembles. WIREs Data
Min Knowl Discov 2001, 1:305315.
36. Strehl A, Ghosh J, Cardie C. Cluster ensemblesa
knowledge reuse framework for combining multiple
partitions. J Mach Learn Res 2002, 3:583617.
37. Schaeffer SE. Graph clustering. Comput Sci Rev 2007,
1:2764.
38. Hathaway RJ, Davenport JW, Bezdek JC. Relational
duals of the cmeans clustering algorithms. Pattern
Recognit 1989, 22:205212.
39. Krishnapuram R, Joshi A, Nasraoui O, Yi L. Low
complexity fuzzy relational clustering algorithms for
web mining. IEEE Trans Fuzzy Syst 2001, 9:595607.
40. Aggarwal C, Hinneburg A, Keim D. On the surprising
behavior of distance metrics in high dimensional space.
Lecture Notes in Computer Science. Springer, 2001,
420434.
41. Baneld JD, Raftery AE. Modelbased Gaussian and
nonGaussian clustering. Biometrics 1993, 49:803
821.
42. Hofmann T, Sch olkopf B, Smola AJ. Kernel methods
in machine learning. Ann Stat 2008, 36:11711220.
43. Burges CJC. A tutorial on support vector machines for
pattern recognition. Data Min Knowl Discov 1998,
2:121167.
44. Kim DW, Lee KY, Lee DH, Lee KH. Evaluation of the
performance of clustering algorithms in kernelinduced
feature space. Pattern Recognit 2005, 38:607611.
45. Heo G, Gader P. An extension of global fuzzy cmeans
using kernel methods. In: 2010 IEEE International
Conference on Fuzzy Systems (FUZZ); 2010, 16.
46. Chen L, Chen CLP, Lu M. A multiplekernel fuzzy c
means algorithm for image segmentation. IEEE Trans
Syst Man Cybern, Part B: Cybern 2011, 99:112.
47. Bezdek JC, Pal NR. Some new indexes of cluster va
lidity. IEEE Trans Syst Man Cybern, Part B: Cybern
1998, 28:301315.
48. Wang JS, Chiang JC. A cluster validity measure with
outlier detection for support vector clustering. IEEE
Trans Syst Man Cybern, Part B: Cybern 2008, 38:78
89.
49. Pal NR, Bezdek JC. On cluster validity for the fuzzy
cmeans model. Fuzzy Systems, IEEE Transactions on
1995, 3(3):370379.
50. Kaufman L, Rousseeuw P. Finding Groups in Data.
New York: John Wiley & Sons; 1990.
51. Campello RJGB, Hruschka ER. A fuzzy extension of
the silhouette width criterion for cluster analysis. Fuzzy
Sets Syst 2006, 157:28582875.
52. Vendramin L, Campello RJGB, Hruschka ER. Rela
tive clustering validity criteria: a comparative overview.
Stat Anal Data Min 2010, 3:209235.
53. Xie XL, Beni G. Avalidity measure for fuzzy clustering.
IEEE Trans Pattern Anal Mach Intell 1991, 13:841
847.
54. Pal NR, Bezdek JC. Correction to on cluster validity
for the fuzzy cmeans model [correspondence]. IEEE
Trans Fuzzy Syst 1997, 5:152153.
55. Bezdek J, Hathaway R. Vat: a tool for visual assessment
of (cluster) tendency. In: Proceedings of International
Joint Conference on Neural Networks 2002, 2225
2230.
56. Bezdek J, Hathaway R, Huband J. Visual assessment
of clustering tendency for rectangular dissimilarity ma
trices. IEEE Trans Fuzzy Syst 2007, 15:890903.
57. Sedgewick R, Flajolet P. An Introduction to the Analy
sis of Algorithms. Boston, MA: AddisonWesley; 1995.
58. Kargupta H, Huang W, Sivakumar K, Johnson E. Dis
tributed clustering using collective principal compo
nent analysis. Knowl Inf Syst 2001, 3:422448.
59. Kriegel HP, Krieger P, Pryakhin A, Schubert M. Effec
tive and efcient distributed modelbased clustering.
IEEE Int Conf Data Min 2005, 0:258265.
60. Olman V, Mao F, Wu H, Xu Y. Parallel clustering algo
rithm for large data sets with applications in bioinfor
matics. IEEE/ACM Trans Comput Biol Bioinf 2009,
6:344352.
61. Bradley PS, Fayyad U, Reina C. Scaling clustering algo
rithms to large databases. In: Proceedings of the Fourth
International Conference on Knowledge Discovery and
Data Mining 1998, 915.
62. Hore P, Hall LO, Goldgof DB. Single pass fuzzy c
means. In: IEEE International Fuzzy Systems Confer
ence, FUZZ IEEE 2007. IEEE; 2007, 17.
63. Hore P, Hall L, Goldgof D, Gu Y, Maudsley A,
Darkazanli A. A scalable framework for segmenting
magnetic resonance images. J Sig Process Syst 2009,
54:183203.
64. Eschrich S, Ke J, Hall LO, Goldgof DB. Fast accurate
fuzzy clustering through data reduction. IEEE Trans
Fuzzy Syst 2003, 11:262270.
65. Hore P, Hall LO, Goldgof DB, Gu Y. Scalable clus
tering code. Available at: http://www.csee.usf.edu/
hall/scalable. (Accessed April 26, 2012).
66. DUrso P. Fuzzy clustering for data time arrays with
inlier and outlier time trajectories. IEEE Trans Fuzzy
Syst 2005, 13:583604.
67. Coppi R, DUrso P. Fuzzy unsupervised classication
of multivariate time trajectories with the Shannon
338 Vol ume 2, J ul y/ August 2012 c 2012 J ohn Wi l ey & Sons, I nc.
WIREs Data Mining and Knowledge Discovery Objective functionbased clustering
entropy regularization. Comput Stat Data Anal 2006,
50:14521477.
68. Liao TW. Clustering of time series data aa survey. Pat
tern Recognit 2005, 38:18571874.
69. Zhang T, Ramakrishnan R, Livny M. Birch: an efcient
data clustering method for very large databases. In:
Proceedings of the 1996 ACMSIGMODInternational
Conference on Management of Data, SIGMOD 96.
New York: ACM; 1996, 103114.
70. Guha S, Rastogi R, ShimK. Cure: an efcient clustering
algorithm for large databases. In: Proceedings of ACM
SIGMOD International Conference on Management
of Data; 1998, 7374.
71. Aggarwal CC, Han J, Wang J, Yu PS. A framework
for clustering evolving data streams. In: Proceedings
of the International Conference on Very Large Data
Bases; 2003.
72. Gupta C, Grossman R. Genic: a single pass generalized
incremental algorithm for clustering. In: Proceedings
of the Fourth SIAM International Conference on Data
Mining (SDM 04), 2004, 2224. In: den Bussche JV,
Vianu V, eds. Database Theory ICDT 2001, volume
1973 of Lecture Notes in Computer Science, Vol. 1973.
Berlin/Heidelberg: Springer; 2001, 420434.
73. Dhillon IS, Mallela S, Kumar R. A divisive information
theoretic feature clustering algorithm for text classi
cation. J Mach Learn Res 2003, 3:12651287.
74. Linde Y, Buzo A, Gray R. An algorithm for vector
quantizer design. IEEE Trans Commun 1980, 28:84
95.
Vol ume 2, J ul y/ August 2012 339 c 2012 J ohn Wi l ey & Sons, I nc.