Sie sind auf Seite 1von 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CYBERNETICS 1

A Hybrid Approach to Clustering in Big Data


Dheeraj Kumar, James C. Bezdek, Life Fellow, IEEE, Marimuthu Palaniswami, Fellow, IEEE,
Sutharshan Rajasegarar, Christopher Leckie, and Timothy Craig Havens

Abstract—Clustering of big data has received much attention vector of attributes, data clustering is performed on feature
recently. In this paper, we present a new clusiVAT algorithm and vectors xi ∈ Rp , where xi is the p-dimensional feature vec-
compare it with four other popular data clustering algorithms. tor for oi , 1 ≤ i ≤ n. These data can also be represented
Three of the four comparison methods are based on the well
known, classical batch k-means model. Specifically, we use in the form of an n × n dissimilarity matrix D, where Dij
k-means, single pass k-means, online k-means, and cluster- represents dissimilarity (distance) between oi and oj . Usually
ing using representatives (CURE) for numerical comparisons. the Euclidean distance ||xi − xj || is taken as the dissimilarity
clusiVAT is based on sampling the data, imaging the reordered measure, but it can be any norm on Rp .
distance matrix to estimate the number of clusters in the data Many papers and books describe different data cluster-
visually, clustering the samples using a relative of single link-
age (SL), and then noniteratively extending the labels to the ing approaches and their applications [1]–[6]. Among the
rest of the data-set using the nearest prototype rule. Previous large number of clustering approaches in the literature, the
work has established that clusiVAT produces true SL clusters largest two groups are based on hierarchical clustering and
in compact-separated data. We have performed experiments to centroid-based clustering. Hierarchical clustering relies on
show that k-means and its modified algorithms suffer from ini- the fact that nearby objects have a higher probability of
tialization issues that cause many failures. On the other hand,
clusiVAT needs no initialization, and almost always finds par- belonging to the same cluster than to a cluster containing
titions that accurately match ground truth labels in labeled objects that are farther away. This category includes single
data. CURE also finds SL type partitions but is much slower linkage (SL), which is based on cutting large edges in a min-
than the other four algorithms. In our experiments, clusiVAT imum spanning tree (MST) [7]. In this paper, we discuss
proves to be the fastest and most accurate of the five algo- two connectivity-based algorithms, clusiVAT and clustering
rithms; e.g., it recovers 97% of the ground truth labels in the
real world KDD-99 cup data (4 292 637 samples in 41 dimensions) using representatives (CURE). Centroid-based clustering algo-
in 76 s. rithms represent clusters as groups located in close proximity
to their cluster centers. Most centroid-based models depend
Index Terms—Big data cluster analysis, cluster tendency
assessment, data analytics, Internet of things, single linkage. on optimizing an objective function, which typically measures
a property such as: 1) intercluster separation; 2) within-cluster
variance; or 3) both.
I. I NTRODUCTION Technologies such as social media, mobile computing, and
ATA clustering is the problem of partitioning a set of
D unlabeled objects O = {o1 , o2 , . . . , on } into k groups
of similar objects, where 1 < k < n. Before clusters can
the realization of the Internet of Things (IoT) generate an exor-
bitant amount of data every day, which comprise the big data
problem [8]–[11]. Big data approaches currently consider one
be sought, it is necessary to estimate k; this is the clus- or more aspects of the so called 5Vs (volume, velocity, variety,
ter tendency problem. When each object is represented by a value, and veracity) [12]. This paper concentrates on the vol-
ume aspect of big data, which requires novel techniques to be
Manuscript received April 29, 2014; revised January 21, 2015 and
June 9, 2015; accepted August 29, 2015. This work was supported by addressed by conventional data clustering algorithms.
the Australian Research Council (ARC) Research Network on Intelligent Our main contributions in this paper are as follows.
Sensors, Sensor Networks and Information Processing under REDUCE 1) We present our new clusiVAT algorithm for big data
Project Grant EP/I000232/1 through the Digital Economy Programme run by
Research Councils U.K.—a cross council initiative led by EPSRC and con- clustering and perform experiments to compare its
tributed to by Arts and Humanities Research Council, Economic and Social performance with other popular big data clustering
Research Council, and Medical Research Council; and ARC under Grant algorithms: k-means, single pass k-means (spkm), online
LP120100529 and Grant LF120100129. This paper was recommended by
Associate Editor F. Karray. k-means (okm), and CURE.
D. Kumar, M. Palaniswami, and S. Rajasegarar are with the Department 2) We perform experiments on 24 2-D datasets having
of Electrical and Electronic Engineering, University of Melbourne, Gaussian clusters of up to 1 000 000 samples, nine
Melbourne, VIC 3010, Australia (e-mail: dheerajk@student.unimelb.edu.au;
palani@unimelb.edu.au; sraja@unimelb.edu.au). high-dimensional sets of Gaussian clusters (having a
J. C. Bezdek and C. Leckie are with the Department of Computing maximum of 500 000 500-dimensional datapoints), and
and Information Systems, University of Melbourne, Melbourne, VIC 3010, two real-life datasets (the largest of which has 4 292 637
Australia (e-mail: jcbezdek@gmail.com; caleckie@unimelb.edu.au).
T. C. Havens is with the Department of Electrical and Computer vectors with 41 features each), to show the useful-
Engineering, Michigan Technological University, Houghton, MI 49931, USA ness of clusiVAT in terms of CPU time and partition
(e-mail: thavens@mtu.edu). accuracy (PA).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 3) To illustrate the utility of clusiVAT for unlabeled data,
Digital Object Identifier 10.1109/TCYB.2015.2477416 we perform clustering experiments on indoor office
2168-2267 c 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CYBERNETICS

environment energy usage data from the University Ball and Hall [24] proposed a k-means based method named
of Surrey, U.K. [13]–[15]. While clusiVAT is able to ISODATA for data analysis and pattern classification in 1965.
suggest the number of clusters in the dataset, other These algorithms are all batch methods that attempt to min-
algorithms must rely on intuition or guesses or need imize a global objective function. Unfortunately, there is a
to use a clustering tendency algorithm for an esti- sequential version of this model that is essentially a com-
mate of k. clusiVAT partitions have the largest value petitive learning algorithm that is also called k-means. This
of Dunn’s index (DI) amongst candidates found by the sequential version was first proposed by MacQueen [25]
five algorithms. in 1967. Sequential k-means is an application of vector quanti-
4) We apply the Friedman test to show the statistical signif- zation which tries to find k means in the dataset for k clusters,
icance of the accuracy ranking for the various clustering where each data point belongs to the cluster with nearest mean.
algorithms examined in this paper. In this paper, “k-means” refers to the batch version.
The rest of this paper is structured as follows. Section II The k-means algorithm is easy to implement and is compu-
contains related work. Our new clusiVAT model is discussed tationally efficient, but it has various limitations. For example,
in Section III. Section IV gives brief descriptions of k-means the number of clusters is an input for k-means, which is usu-
and two big data relatives, spkm and okm. CURE is reviewed ally not known. More worrisome is the fact that k-means often
in Section V and Section VI gives a overview of the Friedman gets stuck at a local trap state of its objective function, which
test. In Section VII, we discuss the computational complexity may lead to incorrect cluster interpretations. This problem is
of the clusiVAT algorithm. Section VIII contains numer- usually ascribed to poor initialization. Another limitation of
ical comparisons using real and synthetic datasets before k-means is that its distance-based model for identifying good
summarizing in Section IX. clusters depends on the topology of the norm used in its objec-
tive function. The usual model uses an inner product norm
whose topology matches well with elliptically shaped clus-
II. R ELATED W ORK ters. Furthermore, k-means tries to impose the same shape on
Data clustering is primarily concerned with separating all k clusters. Thus, in some sense k-means and SL work well
objects into k different groups, which presupposes one impor- for data distributions at geometrically opposite extremes.
tant preclustering task, namely, estimating the number of A large number of algorithms based on both SL and k-means
clusters in the data (clustering tendency). The visual assess- have been proposed for the big data clustering problem. To
ment of tendency (VAT) algorithm [16] addresses the question the best of our knowledge, the first scalable SL-based algo-
of clustering tendency by reordering the dissimilarity matrix D rithm was proposed in [26], where it was called scalable-VAT
to obtain D∗ so that different clusters may be displayed as dark (sVAT)-SL. The clusiVAT model and algorithm proposed in
blocks along the diagonal of the image of D∗ . this paper are extensions of the ideas presented in [26].
SL proceeds by connecting the next nearest vertex to the Another scalable relative of sVAT-SL was discussed and com-
current edge until the complete MST is formed. k clusters pared to a fast MST algorithm called filter-Kruskal in [27].
are then formed by cutting the largest k − 1 edges of As for the big data versions of k-means, a hierarchical ver-
the MST. SL performs best if the clusters are long, chain-like sion that divides the data into two parts at each step before
clouds, well separated from each other. As cluster separation clustering, named bisecting k-means, was proposed in [28].
decreases and the clusters in the data start merging with each A fast, scalable version of k-means was presented in [29],
other, SL becomes unreliable. Nonetheless, SL has been suc- which does not require all the data to be stored in main mem-
cessfully used in many data clustering applications. In the ory at the same time. A fuzzy algorithm based on k-means for
field of astronomy, dark matter halos were discovered by big data was proposed in [30]. Eschrich et al. [31] replaced
Lacey and Cole [17] using SL. In the field of wireless sen- group points with the group centroid to speed up a fuzzy ver-
sor networks, Moshtaghi et al. [18] used SL for anomaly sion of k-means for big data. Feldman et al. [32] used coresets
detection. Dendrograms, which are visual representations of to approximate a large number of datapoints from big data by
linkage clusters, are used in many numerical taxonomy appli- a single point. In this paper, we have used two big data adap-
cations [19]. In the field of healthcare, SL has been used to tations of k-means namely, spkm, and okm, which split the
segment time-series sensor data for patient monitoring at elder- big dataset into small chunks of data before clustering for
care facilities [20]. Zhang et al. [21] discussed a commercial faster run time. An application of k-means based clustering is
application of VAT for role-based access security. presented in [33].
The k-means algorithm is one of the most popular clustering CURE [34] is a sample-based algorithm for large datasets,
algorithms, mainly because of its simplicity and applicabil- which performs clustering on a small subset, and then extends
ity in various fields and is used extensively in literature as a the results to the entire dataset. It is able to identify clus-
benchmark for clustering algorithms. k-means was developed ters having nonspherical shapes and is robust to outliers.
independently in different scientific fields. For continuous mul- CURE seeks a middle ground between SL and k-means by
tidimensional data, k-means was first explicitly proposed by initializing clusters in the sample, and then, akin to SL, merg-
Steinhaus [22] in 1956 in the field of mechanics. In the field ing nearest clusters until the desired value of k is attained.
of communication, Lloyd [23] proposed k-means for least Since CURE combines elements of linkage-based and central
squares quantization in pulse code modulation in 1957 as a tendency-based clustering methods, it is in some sense an ideal
Bell Laboratory technical note (it was later published in 1982). comparison method for the experiments presented later.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KUMAR et al.: A HYBRID APPROACH TO CLUSTERING IN BIG DATA 3

Algorithm 1: VAT [16] Algorithm 3: siVAT [43], [44]


Input : D − n × n dissimilarity matrix satisfying Input : X = {x1 , x2 , . . . , xN } − N p-dimensional data
– Dij ≥ 0 points
– Dij = Dji ∀ i, j k − overestimate of actual number of clusters
– Dii = 0 ∀ i n − approximating sample size

Output: D∗ − n × n VAT reordered dissimilarity matrix Output: Dn∗ − n × n iVAT reordered dissimilarity matrix
P − VAT reordering indices of D of Dn
d − Ordering of MST cut magnitudes 
S − indices of samples in Dn
Set K = {1, 2, . . . , n}, I = J = ∅ P − VAT reordering indices of Dn
Select (i, j) ∈ arg max Dkq d − Ordering of MST cut magnitudes
k∈K,q∈K
P1 = i; I = {i} and J = K − {i} Select the indices m of k distinguished objects
for t ← 2 to n do m1 = 1
Select (i, j) ∈ arg min Dkq y = {dist{x1 , x1 }, . . . , dist{x1 , xN }}
k∈I,q∈J for t ← 2 to k do
Pt = j; I = I ∪ {j}; J = J − {j}; dt−1 = Dij y = (min{y1 , dist{xmt−1 , x1 }}, . . . ,
end min{yN , dist{xmt−1 , xN }})
for p ← 1 to n do mt =arg max {yj }
for q ← 1 to n do 1≤j≤N
D∗p,q = DPp ,Pq end
end Group objects in X = {x1 , x2 , . . . , xN } with their
end nearest distinguished objects
S1 = S2 = · · · = Sk = ∅
for t ← 1 to N do
Algorithm 2: iVAT [43] l = arg min {dist{xmj , xt }}
1≤j≤k
Input : D∗ − n × n VAT reordered dissimilarity matrix Sl = Sl ∪ {t}
Output: D ∗ − n × n iVAT dissimilarity matrix end
for r ← 2 to n do Randomly select data near each distinguished object
j = arg min {D∗rk } to form Dn
1≤k≤r−1

Drj∗ = D∗rj for t ← 1 to k do
t|
c = {1, 2, . . . , r − 1} − {j} nt = n×|S
N

Drc∗ = max{D∗rj , Djc∗ } Draw nt unique random indices St from St
end
end 
S = kt=1 St ; Dn = dist{xS , xS }

Drc∗ = Dcr∗
Apply VAT to Dn returning D∗n , P and d

Apply iVAT to D∗n returning Dn∗
Statistical evaluation of experimental results has been con-
sidered an essential part of validating new machine learning
methods and for comparison of various clustering algo- for vast research in this area. Other important contributions in
rithms [35], [36]. To compare the effectiveness of different this field are described in [41].
clustering algorithms over multiple datasets, we have used the Our approach for fast clustering in big data finds its roots in
Friedman two-way analysis of variances by ranks test [37], the VAT algorithm, which reorders the input distance matrix
which is a nonparametric equivalent of the repeated measures D to obtain D∗ using a modified Prim’s algorithm. The image
ANOVA test [38]. This test provides an insight into whether I(D∗ ), when displayed as a gray-scale image, shows possible
different clustering algorithms have a performance ranking on clusters as dark blocks along the diagonal. Pseudocode for
a range of datasets. VAT is given in Algorithm 1.
While VAT can often provide a useful estimate of the
number of clusters in a dataset, a much sharper reordered diag-
III. CLUSI VAT A LGORITHM onal matrix image can usually be obtained using improved
The clusiVAT algorithm is based on reordered dissimilar- VAT (iVAT) as described in [42] and [43] (Algorithm 2).
ity images (RDIs), also called as “cluster heat maps.” This iVAT provides better images by replacing input distances in
approach to data clustering and pattern finding has been used the distance matrix D = [dij ] by geodesic distances D = [dij ],
since the late 19th century. In 1899, Petrie [39] used visual given by
reordering to find trends in measurement data. The first use dij = min max Dp[h]p[h+1] (1)
of an RDI based on matrix permutation was demonstrated by p∈Pij 1≤h≤|p|

Czekanowski [40] in 1909. Though it was a manual approach where Pij is the set of all paths from object i (oi ) to object j (oj )
on very small dataset of only 13 samples, it opened the way in O. We use the recursive version of iVAT presented in [43] as
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CYBERNETICS

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 1. Data scatterplot, VAT, sVAT, and siVAT images for small Gaussian clusters (top) and big Gaussian clusters (bottom). (a) Dataset N = 5000.
(b) VAT for N = 5000. (c) sVAT for n = 500. (d) siVAT for n = 500. (e) Dataset N = 1 000 000. (f) VAT for N = 1 000 000. (g) sVAT for n = 500.
(h) siVAT for n = 500.

it has time complexity of O(n2 ) as compared to O(n3 ) for the Algorithm 4: clusiVAT
iterative construction of D ∗ in [42]. Importantly, the theory Input : X = {x1 , x2 , . . . , xN } − N p-dimensional data
that connects SL to VAT also holds for recursive iVAT, which points
preserves VAT order. k − overestimate of actual number of clusters
Though VAT and iVAT work fine for small datasets, n − approximating sample size

they both suffer from resolution and memory constraints Output: Dn∗ − n × n iVAT reordered dissimilarity matrix
that limit their usefulness to input matrices sized on the of Dn
order of 105 or so. To overcome this limitation, scalable- u : X → {1, 2, . . . , k} − cluster membership
VAT [sVAT (Algorithm 3)] was introduced in [44], which
Apply siVAT on X returning Dn∗ ,  S, P, d
works by sampling the big dataset and then constructing a
Choose the number of clusters k using siVAT image
VAT or iVAT image of the sample. sVAT finds a small Dn
t = arg max di
distance matrix (having size n × n) of a subset of the big 1≤i≤k
data, X = {x1 , x2 , . . . , xN }, where n is a “VAT-sized” fraction Form the aligned partition:
of N. siVAT is just like sVAT, except it uses iVAT after the u∗ = {t1 : t2 − t1 : ... : tk − tk−1 }
sampling step. uSP = u∗Pi ; 1 ≤ i ≤ k
i
To illustrate VAT, sVAT, and siVAT consider Fig. 1. Fig. 1(a) for x̂ ∈ X̂ = X − XS do
is the scatterplot of 5000 2-D data points randomly drawn j = arg min{dist{xŝ , xi }}
from a four-component Gaussian mixture having equal prior i∈
S
uŝ = uj (NPR)
probabilities. Its VAT image is shown in Fig. 1(b). Fig. 1(c) end
shows the sVAT image of 500 samples (10% of the total
dataset) which was made in about 0.1% of the time taken
to compute the full VAT image. Fig. 1(b) and (c) both
suggests the presence of four clusters in the data by four The only role played by the MST built by Prim’s algo-
dark blocks along the diagonal, but these dark blocks are rithm in VAT and iVAT is to acquire the array of indices as
much clearer in the siVAT image of the n = 500 sampled edges joining the MST. This array is used in the reordering
points [Fig. 1(d)]. To illustrate the extension of this idea to operation. Now suppose that one of these images suggests that
big data, Fig. 1(e) is a scatterplot of N = 1 000 000 2-D the best guess for the number of clusters in D is k. Having this
points drawn from the same 4 component Gaussian, with estimate, we cut the k − 1 largest edges in the MST, resulting
250 000 points per cluster. In this case, we cannot gener- in k connected subtrees (the clusters). The essential step in
ate a VAT image, indicated by the question mark (?) in clusiVAT is to extend this k-partition of Dn noniteratively to
Fig. 1(f). However, we can generate sVAT and siVAT images the unlabeled objects in X using the nearest (object) prototype
for this big dataset by sampling n = 500 points (0.05% of the rule (NPR). Pseudocode for our new clusiVAT algorithm is
total dataset) from DN . The sVAT image [Fig. 1(g)] suggests given in Algorithm 4.
four clusters, which again are much sharper in the siVAT Next we give a thumbnail sketch of the theory that connects
image [Fig. 1(h)]. SL to clusiVAT. Consider a partition U of a set of n feature
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KUMAR et al.: A HYBRID APPROACH TO CLUSTERING IN BIG DATA 5

 X having k clusters U ∼ {X1 , X2 , . . . , Xk }, so that


vectors an iterative refinement technique as described in [23]. This
X = ki=1 Xi . Define the diameter (Xi ) of cluster Xi as implementation often terminates near a local minimum, but
its performance is very dependent on the initialization used.
(Xi ) = max{d(x, y) : x, y ∈ Xi } (2)
Due to lack of space we omit the pseudocode for the k-means
where d(x, y) is the distance between x and y. Let the (set) algorithm in this paper.
distance between two clusters Xi and Xj be In the next two sections, we describe two variants of
    k-means for big data clustering. Both of these variants use
δ Xi , Xj = min d(x, y) : x ∈ Xi , y ∈ Xj . (3) weighted k-means (wkm) for their implementation. wkm is
Dunn [45] defines the separation index, also known as DI, just like k-means except each data points carries an associ-
for partition U as ated weight. Let {w1 , w2 , . . . , wN } be user specified positive
  weights, then the modified mean, mw,i of cluster Xi is
min1≤i,j≤k δ Xi , Xj given by
i=j
VD (U, X) = . (4) 
max1≤i≤k (Xi )
xj ∈Xi wj xj
The partition {X1 , X2 , . . . , Xk } is a compact separated (CS) mw,i =  . (8)
xj ∈Xi wj
k-partition if DI VD (U, X) > 1. Havens et al. [46] discussed
the relation between SL clusters, DI and partitions. They The weighted sum of squared errors for the k clusters is
showed that DI and the ability of VAT to display cluster ten- given by
dency have a direct theoretical relationship. sVAT-SL described
in [26] produces exact SL clusters for CS datasets. Since the
k

recursive version of iVAT [43] that we use in clusiVAT pre- Jw (U) = wj ||xj − mw,i ||2 . (9)
serves VAT ordering, the same clustering principle applies to i=1 xj ∈Xi
clusiVAT as well.
wkm attempts to minimize the overall weighted
For datasets having DI VD (U, X) < 1, SL clusters cannot be
(within groups) sum of squared errors. wkm is used in
guaranteed by clusiVAT. For such datasets, clusiVAT becomes
the next two sections, and takes as input n p-dimensional
a novel clustering algorithm, that in our experience to date
datapoints {x1 , x2 , . . . , xn }, their corresponding weights
produces much better clusters than SL [27]. Very few datasets
{w1 , w2 , . . . , wn }, and the number of clusters k. An alternative
have the CS property, but clusiVAT can be used for arbitrarily
initialization is to specify initial centroids {m1 , m2 , . . . , mk }.
large datasets (whether CS or not) and, as we shall demonstrate
If initial centroids are not provided, we randomly select
in our comparison experiments, it produces highly accurate
k input points as initial centroids. The outputs returned by
clusters in less time than several well known k-means related
wkm are a set of k clusters U ∼ {X1 , X2 , . . . , Xk }, their cen-
alternatives for clustering big data.
troids M = {mw,1 , mw,2 , . . . , mw,k } and cluster membership
function u : X → {1, 2, . . . , k}.
IV. k-M EANS AND R ELATED A LGORITHMS
Consider N p-dimensional points, X = {x1 , x2 , . . . , xN }
A. Single Pass k-Means
to be clustered into k clusters. Let the set of clusters be
U ∼ {X1 , X2 , . . . , Xk }. k-means seeks a partition having an spkm (Algorithm 5) is a crisp adaptation of the single pass
overall minimum squared error between the sample means of fuzzy c-means algorithm discussed in [48]. Let N be the
the clusters and the points in the clusters. The mean mi of number of points in a dataset, which are unloadable in main
cluster Xi is given by memory; let n denote the number of points that can be loaded
 into memory; and let s = (N/n). spkm divides the N points
xj ∈Xi xj in the big dataset into s chunks of small data. A portion of
mi = (5)
|Xi | the data is loaded into memory and k-means is applied to
where |Xi | is the number of data points in cluster Xi . obtain k clusters. This first set of input data is replaced by the
The squared error for cluster Xi is defined as k weighted means {mw,i : 1 ≤ i ≤ k}, where the weights are
the numbers of points in each cluster, {wi = |Xi | : 1 ≤ i ≤ k}.
J(Xi ) = ||xj − mi ||2 . (6) These k weighted centroids are then merged with the next data
xj ∈Xi chunk and wkm is applied to this merged set with the cen-
The norm ||*|| can be any vector norm on Rp . The usual troids from the previous k-means run taken as initial centroids.
choice is the Euclidean norm, and that is what we use in our This process is repeated until the whole dataset is loaded and
numerical experiments. k-means aims at minimizing the sum processed. After obtaining the final k centroids, the big data
of squared errors for all k clusters, defined as are labeled based on the NPR as shown in Algorithm 5.
spkm can be used with arbitrarily large input data. The two

k
k
most important disclaimers about its effectiveness are that:
J(U) = J(Xi ) = ||xj − mi ||2 . (7)
1) each of the s steps in this procedure is subject to the
i=1 i=1 xj ∈Xi
same limitations and problems as the parent (k-means) and
Minimizing the squared error is an NP-hard problem [47]. 2) the output is clearly dependent on the way the big data
We have used the standard k-means algorithm which uses are divided into s chunks. This becomes clear if you interpret
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CYBERNETICS

Algorithm 5: spkm [48] Algorithm 6: okm [49]


Input : X = {x1 , x2 , . . . , xN } − N p-dimensional data Input : X = {x1 , x2 , . . . , xN } − N p-dimensional data
points points
n − number of data points that can be loaded in n − number of data points that can be loaded in
main memory main memory
k − desired number of clusters k − desired number of clusters
Output: U ∼ {X1 , X2 , . . . , Xk } − set of k clusters Output: U ∼ {X1 , X2 , . . . , Xk } − set of k clusters
M = {mw,1 , mw,2 , . . . , mw,k } − (weighted) M = {mw,1 , mw,2 , . . . , mw,k } − (weighted)
centroids of the {Xj } centroids of the {Xj }
u : X → {1, 2, . . . , k} − cluster membership u : X → {1, 2, . . . , k} − cluster membership
Load X as n sized randomly chosen subsets Load X as n sized subsets
s =  Nn ; X = {X1 , X2 , . . . , Xs } s =  Nn 
w = 1n ; (U, M, u) = wkm(X1 , w, k) X = {X1 , X2 , . . . , Xs }, where
for l ← 2 to s do Xi = {x(i−1)n+1 , x(i−1)n+2 , . . . , xin }
for i ← 1to k do (U1 , M1 , u1 ) = wkm(X1 , 1n , k)
w i = xj ∈Xi wj w1 = {|X11 |, |X12 |, . . . , |X1k |}
end for l ← 2 to s do
w = {w 1 , w 2 , . . . , w k , 1n } (Ul , Ml , ul ) = wkm(Xl , 1n , k, Ml−1 )
(U, M, u) = wkm({M ∪ Xl }, w, k, M) wl = {|Xl1 |, |Xl2 |, . . . , |Xlk |}
end end
for each xi ∈ X do w = {w1 , w2 , . . . , ws }
u(xi ) = arg min dist(xi , mw,j ) (NPR) (U, M, u) = wkm({M1 , M2 , . . . , Ms }, w, k)
1≤j≤k
for each xi ∈ X do
end u(xi ) = arg min dist(xi , mw,j ) (NPR)
for 1 ≤ i ≤ k do 1≤j≤k
Xi = {xj |u(xj ) = i, 1 ≤ j ≤ N} end
end for 1 ≤ i ≤ k do
Xi = {xj |u(xj ) = i, 1 ≤ j ≤ N}
end

each chunk of spkm as one input. Thus, spkm is (chunk by


chunk) sequential, and so, at the mercy of the input ordering
of the s chunks. effect of outliers. CURE uses two types of data structures,
k-d trees [50] and heaps [51]. Each cluster is a unique entry
B. Online k-Means in the heap and has the information regarding the number
of points, representative points, and mean of all the points
okm (Algorithm 6) is a crisp adaptation of the online fuzzy
in the cluster and is arranged in increasing order of the
c-means algorithm discussed in [49]. okm is designed to
distance from its nearest cluster. The k-d tree stores the rep-
overcome the dependence of spkm on chunk selection and
resentative points of all the clusters and can find the nearest
ordering. This model also divides the big data into s chunks
neighbor of each point in constant time. Each representative
of small data that can be loaded in the system memory. okm
point forms a separate cluster in the beginning. The nearest
then performs k-means on each chunk to obtain (s sets of)
clusters are merged to form a big cluster until the prespeci-
k centroids and their weights. The ks centroids obtained
fied number of clusters are obtained. The distance between
from the s chunks are then clustered using wkm, where the
two clusters is the minimum distance between the repre-
weights are the cardinalities of the clusters, to obtain the final
sentative points of the two clusters. Pseudocode for CURE
k centroids. Then, as in spkm, the big data are labeled using
is well documented [34], so we are not reproducing it in
the NPR.
this paper.

V. CURE A LGORITHM
CURE [34] is a hierarchical clustering algorithm that ran- VI. F RIEDMAN T EST
domly samples a fixed number of points from the large The Friedman test [37] ranks the algorithms for each dataset
dataset so that the representative points (hopefully) retain separately, so that the best performing algorithm gets rank 1,
the geometry of the entire dataset. Let the sampled dataset the second best rank 2, and so on. In the case of ties, average
j
be S, and assume that it contains k clusters. Each clus- ranks are assigned to each one of them. Let ri be the rank
ter is represented by a fixed number χ of well-scattered of the jth of the A algorithms on the ith of the B datasets.
points called representative points, which are shrunken toward The average rank of the jth algorithm for all the datasets is
j
the mean of the cluster by a fraction α to have a com- then given by Rj = (1/B) Bi=1 ri . Under the null hypothesis,
pact representation of the cluster as well as to minimize the which states that all the algorithms behave similarly and thus
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KUMAR et al.: A HYBRID APPROACH TO CLUSTERING IN BIG DATA 7

(a) (b) (c)

(d) (e) (f)

Fig. 2. Distinguished object and random sample selection for clusiVAT experiment of a big non-CS dataset having N = 1 000 000 and k = 10 [PA = 99.92%
(left and right tip of the yellow cluster and bottom of the green cluster are the errors)]. (a) Ground truth scatter plot. (b) Random samples from each partition.
(c) siVAT image of samples. clusiVAT (d) MST image of 100 point sample, (e) partition image of sample points, and (f) partition image of entire dataset.

their ranks Rj should be equal, the Friedman statistic To illustrate the above point, consider the 2-D non-CS
⎡ ⎤ big dataset shown in Fig. 2(a). It consists of k = 10 clus-
12B ⎣ A
A(A + 1) ⎦
2 ters comprising 1 000 000 points, which are intermixed with
χF2 = R2j − (10)
A(A + 1) 4 each other and hence difficult to cluster for any algorithm.
j=1
In this experiment we have taken k = 20 and n = 100.
is distributed according to χF2 with A − 1 degrees of freedom. Fig. 2(a) also shows the 20 distinguished objects found using
Iman and Davenport [52] showed that Friedman’s χF2 the sampling procedure of the clusiVAT algorithm (shown
presents a conservative behavior and proposed a better statistic by bold black dots) and their corresponding partition of the
dataset (shown by solid black lines). Fig. 2(b) shows 100
(B − 1)χF2 randomly chosen samples, to which the iVAT algorithm is
FF = (11)
B(A − 1)χF2 applied. Different clusters are more clearly visible in Fig. 2(b),
and hence easy to cluster. Fig. 2(c) shows the siVAT image,
which is distributed according to the F-distribution with A − 1
showing the possibility of four clusters at a coarse level
and (A − 1) × (B − 1) degrees of freedom.
[if you see Fig. 2(b) from a distance, you see four clusters
at four corners of the frame] and finer level examination of
VII. C OMPUTATIONAL C OMPLEXITY Fig. 2(c) reveals the presence of ten clusters. Fig. 2(d) shows
In this section, we discuss the computational complexity and the MST generated using clusiVAT. Its largest nine edges are
PA of the clusiVAT algorithm. Consider a dataset X having shown in green, which would be cut to obtain ten clusters
N p-dimensional datapoints. The first step in clusiVAT is the of the sampled data as shown in Fig. 2(e). Finally, Fig. 2(f)
selection of k distinguished objects which are at a maximum shows the partition image for the entire dataset generated
distance from each other. This step divides the entire dataset using the NPR.
into k partitions which (on average) span almost equally sized This example shows that clusiVAT can find a high qual-
subspaces of Rp . This step has time complexity linear in k . ity partition using a very small subsample of the dataset
The next step in clusiVAT is to randomly select objects from and the sampling process ensures that we get a fairly good
the k partitions to get a total of n samples. The number of representation of the entire dataset with a small number of
objects selected from each partition is proportional to the num- sample points. k-means and related algorithms process the
ber of datapoints in that partition. These n samples, which entire dataset either all at once (for k-means) or in parts
are just a small fraction of N, retain the approximate geom- (spkm and okm). For CURE, the initial sampling step ran-
etry of the dataset. In the next step, VAT is applied to the n domly selects a fixed number of samples from the dataset,
samples, which (including construction of Dn from X) has a which does not guarantee that the samples retain the geom-
time complexity of O(n2 ). So the N × N distance matrix for etry of the entire dataset. If the number of clusters is small,
the big dataset (DN ) is never needed, but just the n × n dis- the probability of CURE samples retaining the data geometry
tance matrix of the sampled dataset (Dn ). The time reported is high, but as the numbers of clusters increases, the accuracy
for all the experiments in this paper includes the time taken of CURE decreases because the samples do not always retain
to calculate this small distance matrix of the sampled points. the geometry of the entire dataset.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CYBERNETICS

VIII. N UMERICAL E XPERIMENTS


To compare clusiVAT with other clustering algorithms
described in this paper, we performed experiments on syn-
thetic as well as real datasets. For all our experiments, except
for the REDUCE energy data experiments (Example 6), we
used labeled ground truth UGT to assess the quality of par- (a) (b) (c)
titions obtained by various clustering algorithms, hence the
Fig. 3. siVAT images of n = 10×k point sample for N = 1 000 000 datasets.
algorithms are performing classification task via a clustering (a) k = 3. (b) k = 4. (c) k = 5.
process with predefined class number. The PA of a cluster-
ing algorithm is given as the ratio of the number of samples
correctly labeled to the total number of samples in the dataset size N = 1 000 000 and having 3–5 clusters in Fig. 3(a)–(c),
#(Correctly labeled samples) respectively. The number of dark blocks along the diagonal in
PA = . (12) Fig. 3 shows the ability of clusiVAT to correctly predict the
#(Total samples)
number of clusters in a big dataset.
When the returned labels exactly match the ground truth PA and CPU times (25 run averages) of all the algorithms
partition, we get a PA of 1; when none of the labels are listed in Table I. The maximum PA and minimum time
match, PA = 0. For the REDUCE energy data experiment for each dataset are shown in bold. As evident from Table I,
(Example 6), the dataset is not ground truth labeled, so we do clusiVAT averages 100% PA, followed closely by CURE, and
not know UGT . For this dataset we have compared the DI (4) then spkm. Average run times over the 25 trials on the biggest
of the partitions obtained by different clustering algorithms. dataset (k = 5 clusters, N = 1 000 000) from Table I are:
Clustering schemes that produce partitions with a high DI, clusiVAT (0.78 s); k-means (11.29 s); spkm (8.76 s); okm
especially a DI > 1, are better because this indicates that the (8.85 s); and CURE (46.84 s). Thus, clusiVAT is roughly 12
clusters are farther apart relative to their size. k-PA (DI for times faster than the three k-means methods, and about 60
Example 6) and CPU times are used to compare clusiVAT with times faster than CURE on the N = 1 000 000, k = 5 data set.
outputs obtained from the other four algorithms. The Friedman The last row of Table I gives column averages, which can be
test is later performed on the PA results for Examples 1–5 used to get a better idea of comparative times and accuracies of
and DI results for Example 6 to validate the performance the five algorithms over the 12 big CS datasets. To summarize
ranking of various clustering algorithms. All programs are Table I, clusiVAT and CURE achieves very high (100% and
written in MATLAB 2012. The computational platform is OS: 99.691%, respectively) accuracy over all 12 × 25 = 300 runs,
Windows 7 (64 bit); processor: Intel Core i7-2600 @3.40 GHz; but are at opposite ends of the CPU time spectrum. clusiVAT
RAM: 8 GB. All the algorithm implementations are optimized takes an average time of 0.29 s for all 300 runs, whereas
for multithread operations to utilize all the available threads CURE needs an average time of 21.02 s. The three k-means
in the i7-2600 processor. algorithms fall in-between these extremes in both time and
For all experiments, except Example 6, we set clusiVAT accuracy. Note that spkm is considerably more accurate than
parameters k = 2 × k and n = 10 × k, where k is the number k-means or okm. Over all 300 runs represented in Table I,
of labeled subsets in the ground truth partition of the data. clusiVAT runs in the least time with the highest accuracy.
For spkm and okm, n is set to be 10% of N, the total number
of datapoints. The CURE algorithm is applied on n = 10 × k B. Example 2 (2-D Non-CS Big Data Experiment)
randomly chosen data points from the entire dataset, with
χ = 5 and α = 0.7. For Example 6, since we do not know k, This example uses the same options as Example 1 for
we have used k = 10 and n = 300. The number of sam- dataset size (N = 100 000, 200 000, 500 000, and 1 000 000)
ple points for clusiVAT and CURE are taken to be equal and the number of clusters (k = 3, 4, and 5). The differ-
to afford a fair comparison of clustering accuracy and time. ence between these sets of 2-D Gaussian clusters and those in
k-means, spkm, and okm use the entire dataset for their oper- Example 1 is that the parameters of the mixture are adjusted so
ation, hence their time and accuracy does not vary too much that DI for the ground truth partition UGT is less than 1. This
with k. All the experiments were performed 25 times on each does not guarantee that there is not another partition of these
dataset and the average results are reported. sets for which DI is greater than 1—i.e., we cannot say that
the datasets are not CS, we can only assert that it is highly
unlikely. Thus, we will call these datasets non-CS with the
A. Example 1 (2-D CS Big Data Experiment) understanding that we really mean “probably not CS.” PA and
This example uses 12 big 2-D datasets with size (N = CPU times (25 run averages) of all the algorithms are listed
100 000, 200 000, 500 000, and 1 000 000). Each dataset has in Table II. Since the datasets are non-CS, clusiVAT does not
three options for the number of clusters (k = 3, 4, and 5). The guarantee SL clusters, but still its PA is greater than 99.8%
ground truth partitions of all 12 of these big datasets have for all 12 cases. k-means based schemes and CURE perform
the CS property; that is, they have a DI greater than 1. Each poorly for these types of clusters. The CPU time ratios we
group of four datasets having fixed size contains clusters drawn highlighted in Example 1 are almost identical to those seen
from 2-D Gaussian mixtures having various a priori probabil- in Table II. Inspecting the last row of Table II, we see that in
ities. Fig. 3 shows the siVAT images for the three datasets of terms of accuracy, the ranks of the five algorithms are, from
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KUMAR et al.: A HYBRID APPROACH TO CLUSTERING IN BIG DATA 9

TABLE I
AVERAGE R ESULTS OF 25 RUNS FOR THE 12 B IG DATASETS OF CS G AUSSIAN C LUSTERS

TABLE II
AVERAGE R ESULTS OF 25 RUNS FOR THE 12 B IG DATASETS OF N ON -CS G AUSSIAN C LUSTERS

best to worst: clusiVAT, CURE, spkm, k-means, and okm. In


terms of CPU time for these 300 runs, from fastest to slow-
est, the ranks are: clusiVAT, k-means, okm, spkm, and CURE.
CURE is on average seven times slower than the other four
algorithms.

C. Example 3 (High-Dimensional Big Data Experiment)


This experiment compares the five algorithms on high-
dimensional big data having k = 100 Gaussian clusters with Fig. 4. Typical siVAT reordered distance matrix image for a n = 1000
p = 100, 200, and 500 features and N = 100 000, 200 000, sample points of k = 100 Gaussian clusters with p = 100 and N = 100 000.
and 500 000 datapoints. Fig. 4 shows a typical siVAT image
for a high-dimensional big dataset having N = 100 000 points D. Example 4 (Forest Cover Type Data Experiment)
and p = 100 dimensions each. The siVAT image indicates the This experiment uses the forest cover data [53], which
presence of 100 clusters by 100 dark blocks along the diagonal consists of 54 cartographic features obtained by the U.S.
(a serious zoom is needed to see them all!). Geological Survey and U.S. Forest Service USFS, for a total
The average result of 25 runs on these nine high- of N = 581 012 (30 m×30 m) cells. These data were then cat-
dimensional big datasets is given in Table III. DI of the ground egorized into seven forest types. This is a challenging dataset
truth partition for each of these datasets is less than 1, so these for any clustering algorithm as it contains ten continuous fea-
datasets are probably not CS. It is also evident from Table III tures, and 44 binary features (four wilderness types and 40 soil
that clusiVAT provides the best PA in the minimum time. We types). We normalized the continuous features to the interval
attribute the poor performance of CURE as compared to pre- [0, 1] so that they had the same scale as the binary features.
vious experiments to the large number of clusters (k = 100). The ground truth partition is not CS as DI for the ground truth
It is likely that with many clusters, the randomly drawn sam- labels is 0.006.
ples used by CURE do not adequately capture the geometry Table IV contains the average PA and CPU times of our
of the big data. five models on the forest data over 25 runs. As you can see,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CYBERNETICS

TABLE III
AVERAGE R ESULTS OF 25 RUNS FOR THE N INE H IGH -D IMENSIONAL B IG DATASETS H AVING 100 N ON -CS G AUSSIAN C LUSTERS

TABLE IV
AVERAGE R ESULTS OF 25 RUNS FOR F OREST C OVER T YPE DATASET

Fig. 5. siVAT reordered distance matrix image for a n = 70 sample points Fig. 6. siVAT reordered distance matrix image for a n = 230 sample points
of the Forest dataset. of the KDD-99 training dataset.

clusiVAT recovers the highest percentage (44.7%) of the Users to root attack starts out with access to a normal user
ground truth labels with a run time of 4.23 s, and CURE is a account on the system and is able to exploit some vulnerabil-
very close second, with PA = 43.6%, but at a time cost that ity to obtain root access to the system. It contains the following
is about 12 times higher than clusiVAT. The siVAT image in attacks: “buffer_overflow,” “loadmodule,” “rootkit,” and “perl.”
Fig. 5 of the forest data carries no suggestion that k = 7 is Remote to local attack sends packets to a machine to which
a good assessment of cluster tendency in the forest data. In the attacker does not have legitimate access, and exploits
fact, this image suggests k = 2 clusters at low resolution, and some vulnerability to gain local access as a user. It consists
within the larger cluster perhaps k = 15 smaller clusters; so, of the attacks: “warezclient,” “multihop,” “ftp_write,” “imap,”
we are not unhappy with these accuracies. “guess_passwd,” “warezmaster,” “spy,” and “phf.” Probing
attack attempts to gather information about a network of com-
puters for the apparent purpose of circumventing its security. It
E. Example 5 (KDD Cup 99 Data Experiment) contains the following attacks: “portsweep,” “satan,” “nmap,”
This data set was used for the Third International and “ipsweep.”
Knowledge Discovery and Data Mining Tools Competition. Fig. 6 shows a siVAT image for the KDD-99 training
The KDD-99 training dataset consists of 4 292 637 instances dataset. The middle big black block represents the “smurf”
of 41 dimensional vectors and is labeled data that specifies the attack comprising 60% of the total dataset. The top left cor-
attack type (normal or attack). We normalized the 41 features ner black block represents the “normal” case (approximately
to the interval [0, 1] so that they all had the same scale. The 18% of the total datapoints) and the bottom right corner block
ground truth partition is not CS as DI for the ground truth represents the “neptune” attack (approximately 20% of the
labels is 0. total datapoints). The remaining attack types are represented
KDD-99 has 22 simulated attack types, which fall in one by very small black subblocks along the diagonal.
of the following four categories [54]. Denial of service attack We have performed experiments to cluster different attack
makes some computing or memory resources too busy or too types and the average results of 25 runs are given in Table V.
full to handle legitimate requests. It consists of the attacks: clusiVAT performs well, having an average accuracy of 97.1%
“neptune,” “back,” “smurf,” “pod,” “land,” and “teardrop.” and a minimum average time of 76.0 s.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KUMAR et al.: A HYBRID APPROACH TO CLUSTERING IN BIG DATA 11

TABLE V
AVERAGE R ESULTS OF 25 RUNS FOR KDD-99 T RAINING DATASET

TABLE VI
PIR, N OISE , AND L IGHT S ENSOR VALUE P LOTS FOR T HREE C LUSTERS AT N ODE 4

F. Example 6 (REDUCE Energy Data Experiment)


This energy data was collected from a real wireless sen-
sor network deployment at the Guildford campus in CCSR,
University of Surrey, U.K., as part of the IoT project [13]–[15].
It comprises energy usage data (of connected devices) and the
context of the users from an indoor office environment. There
are 127 nodes deployed in two floors, which measure total (a) (b)
power, reactive power, phase angle, root mean square (RMS)
voltage, RMS current, light, temperature, Passive infrared
(PIR, a motion sensor), noise (acoustic) level, and vibration
(total of ten features) at each desk. This is an example of a
real world unlabeled dataset.
We clustered the attributes collected at each node during the
period of April 1, 2012 to April 15, 2012 (sampling period of
10 s). We took the average value of all the sensor readings
for 15 min durations at each node to average the outliers. The (c) (d)
total number of samples for each node is 1440 [15 days ×
24 h × 4 (15 min intervals per hour)]. The siVAT images for Fig. 7. siVAT images of 300 sample points of four energy data nodes.
the four nodes shown in Fig. 7 indicates the presence of 3, 2, (a) Node 4. (b) Node 93. (c) Node 103. (d) Node 120.
4, and 2 clusters for nodes 4, 93, 103, and 120, respectively.
To illustrate that different clusters are indeed separated from by the PIR and noise sensor values, which indicate persistent
each other, a plot of the PIR, noise, and light sensor values human presence during daytime, infrequent presence during
at node 4 is plotted for three clusters obtained using clusiVAT evenings and almost no presence during night.
(Table VI). The average light intensity value for cluster 1 is Since these data are not labeled, we used DI as a measure
close to 500, indicating the presence of sunlight (daytime); for of accuracy for different algorithms. Experimental results for
cluster 2, it is close to 150, indicating lamp light (cloudy day or the data obtained from four nodes are listed in Table VII. The
evening times); and that of cluster 3 is much less (around 20) maximum DI and minimum time for each node are shown
indicating night time. These conclusions are also supported in bold. The highest DI value was achieved using clusiVAT,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON CYBERNETICS

TABLE VII
AVERAGE R ESULTS OF 25 RUNS FOR REDUCE E NERGY DATASET

DI DI DI DI DI

indicating that the clusiVAT partitions are superior with regard clusiVAT gives an accuracy of 100% in much less time than
to Dunn’s validity measure. k-means and its variants, and CURE. For 2-D non-CS datasets,
clusiVAT gives quite high accuracy (≥99.8%) in 12–18 times
G. Friedman Test less CPU time than k-means and its relatives, and 60–90
In this experiment, we apply Friedman test [37] to the results times less CPU time than CURE. To illustrate the utility of
obtained from all the A = 5 clustering algorithms on all the clusiVAT for unlabeled data, we performed experiments on
B = 39 datasets (described in Examples 1–6). For the datasets an energy dataset and demonstrated that clusiVAT produced
in Examples 1–5, we have used PA, and for Example 6 we clusters having a DI much greater than 1. Since the data are
have used DI as a measure to rank the algorithms. The algo- unlabeled, there is no way to assess PA for the energy data.
rithm giving highest PA/DI is assigned rank 1, the second The Friedman test performed on the PA and DI results for
highest is assigned rank 2, and so on. If two or more algo- different datasets validates the performance ranking of the var-
rithms have the same value of PA/DI, the average of their ranks ious clustering algorithms, the average ranks being 1.56, 4.18,
is assigned to each of them. For example, for the N = 100 000 2.17, 4.36, and 2.73 for clusiVAT, k-means, spkm, okm, and
and k = 3 CS big dataset in Example 1 [Section VIII-A], the CURE, respectively.
PA values for clusiVAT, k-means, spkm, okm, and CURE are k-means and its big data variants are sometimes plagued
100, 92.5, 100, 80, and 100, respectively. The highest PA of by initialization issues, which result in extremely poor per-
100 is achieved by clusiVAT, spkm, and CURE, so each of formance; this brings down their average accuracy. Of the
them is assigned a rank of 2 (average of 1–3). k-means is k-means algorithms, spkm seemed to perform the best because
assigned rank 4 and okm rank 5. of its accuracy on small datasets. CURE creates relatively good
The average ranks of the algorithms over all 39 datasets clusters for smallish CS and non-CS datasets, but takes much
are calculated to be 1.56, 4.18, 2.17, 4.36, and 2.73 for clusi- more time than clusiVAT. A significant advantage of clusiVAT,
VAT, k-means, spkm, okm, and CURE, respectively, giving as compared to CURE, is the siVAT image of the clusiVAT
the Friedman statistics of χF2 = 94.64 and FF = 106.52 sample, which provides us with a best guess for k, whereas
[using (10) and (11)]. With A = 5 algorithms and B = 39 CURE must be initialized with an uninformed user-supplied
datasets, χF2 is distributed according to a χF2 distribution with best guess. In summary, we think that the examples presented
A − 1 = 4 degrees of freedom and FF is distributed according here justify further study of clusiVAT for detecting substruc-
to a F distribution with A − 1 = 4 and (A − 1) × (B − 1) = 152 ture in big data. We are especially interested in testbeds for
degrees of freedom. The probability of the null hypothesis real world problems, so our next project is to move this model
(all the algorithms behave similarly and thus their ranks Rj and algorithm into the big data applications domain.
should be equal) computed using both, χF2 (4), and FF (4, 152)
is 0, so the null hypothesis is rejected. Hence, on the basis R EFERENCES
of the experiments performed in this paper, we can con- [1] A. Jain, M. Murty, and P. Flynn, “Data clustering: A review,”
clude that the ranking of clustering algorithms based on PA ACM Comput. Surv., vol. 31, no. 3, pp. 264–323, Sep. 1999.
(DI for Example 6), from best to worst, clusiVAT, spkm, [2] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expres-
CURE, k-means, and okm is consistent. sion data: A survey,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 11,
pp. 1370–1386, Nov. 2004.
[3] A. K. Jain, “Data clustering: 50 years beyond k-means,” in Machine
IX. C ONCLUSION Learning and Knowledge Discovery in Databases. Berlin, Germany:
Springer, 2008, pp. 3–4.
In this paper, we have illustrated our new clusiVAT algo- [4] J. Bezdek, Pattern Recognition With Objective Function Algorithms.
rithm for big datasets and have compared its performance to New York, NY, USA: Plenum, 1981.
four other popular clustering algorithms: 1) k-means; 2) spkm; [5] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spec-
tral clustering by exploring intertask correlation,” IEEE Trans. Cybern.,
3) okm; and 4) CURE. vol. 45, no. 5, pp. 1069–1080, May 2015.
To show the usefulness of clusiVAT in terms of CPU [6] H. Zhu, C. Liu, Y. Ge, H. Xiong, and E. Chen, “Popularity modeling
time and PA, we performed experiments on 24 2-D syn- for mobile Apps: A sequential approach,” IEEE Trans. Cybern., vol. 45,
no. 7, pp. 1303–1314, Jul. 2015.
thetic datasets (having a maximum of 1 000 000 datapoints), [7] R. Sibson, “SLINK: An optimally efficient algorithm for the single-
nine high-dimensional synthetic datasets (having a maximum link cluster method,” Comput. J. (Brit. Comput. Soc.), vol. 16, no. 1,
of 500 000, 500 dimensional datapoints), and two real-life pp. 30–34, Jan. 1973.
[8] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of
big datasets (the largest of which has 4 292 637 vectors with Things (IoT): A vision, architectural elements, and future directions,”
41 features each). We found that for CS datasets our new Future Gener. Comput. Syst., vol. 29, no. 7, pp. 1645–1660, Sep. 2013.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

KUMAR et al.: A HYBRID APPROACH TO CLUSTERING IN BIG DATA 13

[9] A. Shilton, S. Rajasegarar, C. Leckie, and M. Palaniswami, “DP1SVM: [34] S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algo-
A dynamic planar one-class support vector machine for Internet rithm for large databases,” in Proc. ACM SIGMOD Int. Conf. Manage.
of Things environment,” in Proc. Int. Conf. Rec. Adv. Internet Data, New York, NY, USA, Jun. 1998, pp. 73–84.
Things (RIoT), Singapore, Apr. 2015, pp. 1–6. [35] J. Demšar, “Statistical comparisons of classifiers over multiple data sets,”
[10] J. Jin, J. Gubbi, S. Marusic, and M. Palaniswami, “An information frame- J. Mach. Learn. Res., vol. 7, pp. 1–30, Dec. 2006. [Online]. Available:
work for creating a smart city through Internet of Things,” IEEE Internet http://dl.acm.org/citation.cfm?id=1248547.1248548
Things J., vol. 1, no. 2, pp. 112–121, Apr. 2014. [36] S. García, A. Fernández, J. Luengo, and F. Herrera, “Advanced non-
[11] Internet of Things (IoT) for Creating Smart Cities. [Online]. Available: parametric tests for multiple comparisons in the design of experiments
http://issnip.unimelb.edu.au/research_program/sensor_networks/ in computational intelligence and data mining: Experimental analysis of
Internet_of_Things, accessed Jun. 5, 2015. power,” Inf. Sci., vol. 180, no. 10, pp. 2044–2064, May 2010. [Online].
[12] D. Laney, 3D-Data Management: Controlling Data Volume, Velocity and Available: http://dx.doi.org/10.1016/j.ins.2009.12.010
Variety, Gartner, Stamford, CT, USA, 2001. [37] M. Friedman, “The use of ranks to avoid the assumption of normality
[13] L. Rashidi et al., “Profiling spatial and temporal behaviour in sensor implicit in the analysis of variance,” J. Amer. Statist. Assoc., vol. 32,
networks: A case study in energy monitoring,” in Proc. IEEE 9th Int. no. 200, pp. 675–701, Dec. 1937.
Conf. Intell. Sensors Sensor Netw. Inf. Process. (ISSNIP), Singapore, [38] R. A. Fisher, Statistical Methods and Scientific Inference. New York,
Apr. 2014, pp. 1–7. NY, USA: Hafner, 1959.
[14] M. Nati, A. Gluhak, H. Abangar, and W. Headley, “SmartCampus: [39] W. Petrie, “Sequences in prehistoric remains,” J. Anthropol. Inst. Great
A user-centric testbed for Internet of Things experimentation,” Britain Ireland, vol. 29, nos. 3–4, pp. 295–301, 1899.
in Proc. 16th Int. Symp. Wireless Pers. Multimedia Commun. (WPMC), [40] J. Czekanowski, “Zur differentialdiagnose der neandertal-gruppe,”
Atlantic City, NJ, USA, Jun. 2013, pp. 1–6. Korrespondenzblatt Deutsch. Ges. Anthropol. Ethnol. Urgesch., vol. 40,
nos. 6–7, pp. 44–47, 1909.
[15] S. Rajasegarar et al., “Ellipsoidal neighbourhood outlier factor for
[41] L. Wilkinson and M. Friendly, “The history of the cluster heat map,”
distributed anomaly detection in resource constrained networks,”
Amer. Statist., vol. 63, no. 2, pp. 179–184, May 2009.
Pattern Recognit., vol. 47, no. 9, pp. 2867–2879, Sep. 2014.
[42] L. Wang, X. Geng, J. Bezdek, C. Leckie, and K. Ramamohanarao,
[16] J. C. Bezdek and R. J. Hathaway, “VAT: A tool for visual assessment “Enhanced visual analysis for cluster tendency assessment and data
of (cluster) tendency,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), partitioning,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 10,
Honolulu, HI, USA, pp. 2225–2230, May 2002. pp. 1401–1414, Oct. 2010.
[17] C. Lacey and S. Cole, “Merger rates in hierarchical models of galaxy [43] T. C. Havens and J. C. Bezdek, “An efficient formulation of the improved
formation. II: Comparison with N-body simulations,” Mon. Not. Roy. visual assessment of cluster tendency (iVAT) algorithm,” IEEE Trans.
Astron. Soc., vol. 271, no. 3, pp. 676–692, Feb. 1994. Knowl. Data Eng., vol. 24, no. 5, pp. 813–822, May 2012.
[18] M. Moshtaghi et al., “Clustering ellipses for anomaly detection,” [44] R. J. Hathaway, J. C. Bezdek, and J. M. Huband, “Scalable visual assess-
Pattern Recognit., vol. 44, no. 1, pp. 55–69, Jan. 2011. ment of cluster tendency for large data sets,” Pattern Recognit., vol. 39,
[19] P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy—The Principles no. 7, pp. 1315–1324, Jul. 2006.
and Practice of Numerical Classification. San Francisco, CA, USA: [45] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in
W. H. Freeman, 1973. detecting compact well-separated clusters,” J. Cybern., vol. 3, no. 3,
[20] A. Wilbik, J. M. Keller, and J. C. Bezdek, “Linguistic prototypes for pp. 32–57, 1973.
data from eldercare residents,” IEEE Trans. Fuzzy Syst., vol. 22, no. 1, [46] T. C. Havens, J. C. Bezdek, J. M. Keller, M. Popescu, and J. M. Huband,
pp. 110–123, Mar. 2013. “Is VAT really single linkage in disguise?” Ann. Math. Artif. Intell.,
[21] D. Zhang, K. Ramamohanarao, S. Versteeg, and R. Zhang, “RoleVAT: vol. 55, nos. 3–4, pp. 237–251, Apr. 2009.
Visual assessment of practical need for role based access control,” [47] D. Aloise, A. Deshpande, P. Hansen, and P. Popat, “NP-hardness
in Proc. Conf. Comput. Security Appl., Honolulu, HI, USA, Dec. 2009, of Euclidean sum-of-squares clustering,” Machine Learn.,
pp. 13–22. vol. 75, no. 2, pp. 245–248, Jan. 2009. [Online]. Available:
[22] H. Steinhaus, “Sur la division des corp materiels en parties,” Bull. Acad. http://dx.doi.org/10.1007/s10994-009-5103-0
Polon. Sci., vol. 4, no. 12, pp. 801–804, 1956. [48] P. Hore, L. Hall, and D. Goldgof, “Single pass fuzzy C means,” in Proc.
[23] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, IEEE Int. Fuzzy Syst. Conf., London, U.K., Jul. 2007, pp. 1–7.
vol. 28, no. 2, pp. 129–137, Mar. 1982. [49] P. Hore et al., “A scalable framework for segmenting magnetic reso-
[24] G. Ball and D. Hall, “ISODATA, a novel method of data analysis nance images,” J. Signal Process. Syst., vol. 54, nos. 1–3, pp. 183–203,
and pattern classification,” Stanford Res. Inst., Stanford, CA, USA, Jan. 2009.
Tech. Rep. NTIS AD 699616, 1965. [50] H. Samet, Spatial Data Structures. Reading, MA, USA:
[25] J. MacQueen, “Some methods for classification and analysis of Addison-Wesley, 1995.
multivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat. [51] T. H. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to
Probab., Berkeley, CA, USA, 1967, pp. 281–297. Algorithms. Cambridge, MA, USA: MIT Press, 2001.
[26] T. Havens, J. C. Bezdek, and M. Palaniswami, “Scalable single link- [52] R. L. Iman and J. M. Davenport, “Approximations of the critical
age clustering for big data,” in Proc. IEEE ISSNIP, Melbourne, VIC, region of the Friedman statistic,” Commun. Statist., vol. 9, pp. 571–595,
Australia, Apr. 2013, pp. 396–401. Jan. 1980.
[53] J. A. Blackard and D. J. Denis, “Comparative accuracies of artificial neu-
[27] D. Kumar et al., “clusiVAT: A mixed visual/numerical clustering algo-
ral networks and discriminant analysis in predicting forest cover types
rithm for big data,” in Proc. IEEE Int. Conf. Big Data, Silicon Valley,
from cartographic variables,” Comput. Electron. Agri., vol. 24, no. 3,
CA, USA, Oct. 2013, pp. 112–117.
pp. 131–151, 2000.
[28] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document [54] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis
clustering techniques,” in Proc. Workshop KDD, 2000. of the KDD’99 CUP data set,” in Proc. 2nd IEEE Symp. Comput. Intell.
[29] P. Bradley, U. Fayyad, and C. Reina, “Scaling clustering algorithms Conf. Security Defense Appl. (CISDA), vol. 40. Ottawa, ON, Canada,
to large databases,” in Proc. 4th Int. Conf. Knowl. Disc. Data Mining, 2009, pp. 44–47.
Menlo Park, CA, USA, 1998, pp. 9–15.
[30] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami,
“Fuzzy c-means algorithms for very large data,” IEEE Trans. Fuzzy Syst.,
vol. 20, no. 6, pp. 1130–1146, Dec. 2012. Dheeraj Kumar received the B.Tech. and M.Tech.
[31] S. Eschrich, J. Ke, L. O. Hall, and D. B. Goldgof, “Fast accurate fuzzy dual degrees in electrical engineering from the
clustering through data reduction,” IEEE Trans. Fuzzy Syst., vol. 11, Indian Institute of Technology Kanpur, Kanpur,
no. 2, pp. 262–270, Apr. 2003. India, in 2010. He is currently pursuing the Ph.D.
[32] D. Feldman, M. Schmidt, and C. Sohler, “Turning big data into tiny data: degree with the Department of Electrical and
Constant-size coresets for k-means, PCA and projective clustering,” in Electronic Engineering, University of Melbourne,
Proc. 24th Annu. ACM Symp. Discrete Algorithms, New Orleans, LA, Melbourne, VIC, Australia.
USA, 2013, pp. 1434–1453. His current research interests include big data
[33] J. Cao, Z. Wu, J. Wu, and H. Xiong, “Sail: Summation-based incremental clustering, incremental clustering, spatio-temporal
learning for information-theoretic text clustering,” IEEE Trans. Cybern., estimations, Internet of Things, machine learning,
vol. 43, no. 2, pp. 570–584, Apr. 2013. pattern recognition, and signal processing.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON CYBERNETICS

James C. Bezdek (LF’10) received the Ph.D. degree Sutharshan Rajasegarar received the B.Sc.
in applied mathematics from Cornell University, Engineering degree in electronic and telecommu-
Ithaca, NY, USA, in 1973. nication engineering (First Class Hons.) from the
His current research interests include optimiza- University of Moratuwa, Moratuwa, Sri Lanka, in
tion, pattern recognition, clustering in very large 2002, and the Ph.D. degree from the University of
data, co-clustering, visual clustering, and cluster Melbourne, Melbourne, VIC, Australia, in 2009.
validity. He is currently a Research Fellow with
Prof. Bezdek was a recipient of the IEEE 3rd the Department of Electrical and Electronic
Millennium, the IEEE Computational Intelligence Engineering, University of Melbourne. His current
Society Fuzzy Systems Pioneer, the IEEE Technical research interests include wireless sensor networks,
Field Award Rosenblatt, and the Kampe de Feriet anomaly/outlier detection, spatio-temporal esti-
medals. He is the Past President of the North American Fuzzy Information mations, Internet of Things, machine learning, pattern recognition, signal
Processing Society, the International Fuzzy Systems Association (IFSA), processing, and wireless communication.
and the IEEE Computational Intelligence Society, the Founding Editor
the International Journal of Approximate Reasoning and the IEEE
T RANSACTIONS ON F UZZY S YSTEMS, and a Life Fellow of the IFSA.

Christopher Leckie received the B.Sc. and B.E.


degrees in electrical and computer systems engi-
neering (First Class Hons.), and the Ph.D. degree
in computer science from Monash University,
Melbourne, VIC, Australia, in 1985, 1987, and
1992, respectively.
He joined Telstra Research Laboratories,
Melbourne, VIC, Australia, in 1988, where he
Marimuthu Palaniswami (F’12) received the M.E. conducted research and development into artificial
degree in electrical, electronic and control engineer- intelligence techniques for various telecommunica-
ing from the Indian Institute of Science, Bengaluru, tion applications. In 2000, he joined the University
India, the M.Eng.Sc. degree in electrical, elec- of Melbourne, Melbourne, VIC, Australia, where he is currently a Professor
tronic and control engineering from the University with the Department of Computing and Information Systems. His current
of Melbourne, Melbourne, VIC, Australia, and the research interests include scalable data mining, network intrusion detection,
Ph.D. degree in electrical, electronic and con- bioinformatics, and wireless sensor networks.
trol engineering from the University of Newcastle,
Callaghan, NSW, Australia.
He is currently a Professor with the University of
Melbourne. He is representing Australia as a core
partner in EU FP7 projects such as SENSEI, SmartSantander, Internet of Timothy Craig Havens received the Ph.D. degree
Things Initiative, and SocIoTal. He has been funded by several Australian in electrical and computer engineering from the
Research Council (ARC) and industry grants (over 40 million) to conduct University of Missouri, Columbia, MO, USA, in
research in sensor network, Internet of Things (IoT), health, environmental, 2010.
machine learning, and control areas. He has published over 400 refer- He is currently the William and Gloria Jackson
eed research papers, and leads one of the largest funded ARC Research Assistant Professor with the Department of
Network on Intelligent Sensors, Sensor Networks and Information Processing Electrical and Computer Engineering and the
Programme. His current research interests include SVMs, sensors and sen- Department of Computer Science, Michigan
sor networks, IoT, machine learning, neural network, pattern recognition, and Technological University, Houghton, MI, USA.
signal processing and control. Dr. Havens was a recipient of the Best Paper
Prof. Palaniswami was a Panel Member for NSF, an Advisory Board Award at the 2012 IEEE International Conference
Member for European FP6 Grant Centre, a Steering Committee Member on Fuzzy Systems, the IEEE Franklin V. Taylor Award for Best Paper at
for National Collaborative Research Infrastructure Strategy Great Barrier the 2011 IEEE Conference on Systems, Man, and Cybernetics, and the Best
Reef Ocean Observing System and Smart Environmental Assessment and Student Paper Award from the Western Journal of Nursing Research in
Monitoring Technologies, and a Board Member for Information Technology 2009. He is an Associate Editor of the IEEE T RANSACTIONS ON F UZZY
and Supervisory Control and Data Acquisition companies. S YSTEMS.

Das könnte Ihnen auch gefallen