1 s2.0 S0020025514000504 Main

Information Sciences 267 (2014) 16–34
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Probabilistic cluster structure ensemble

Zhiwen Yu a,⇑, Le Li a, Hau-San Wong b, Jane You c, Guoqiang Han a, Yunjun Gao d, Guoxian Yu e
a
School of Computer Science and Engineering, South China University of Technology, China
b
Department of Computer Science, City University of Hong Kong, Hong Kong
c
Department of Computing, Hong Kong Polytechnic University, Hong Kong
d
College of Computer Science, Zhejiang University, China
e
College of Computer and Information Science, Southwest University, Chongqing, China
a r t i c l e i n f o a b s t r a c t
Article history: Cluster structure ensemble focuses on integrating multiple cluster structures extracted
Received 7 April 2013 from different datasets into a unified cluster structure, instead of aligning the individual
Received in revised form 12 November 2013 labels from the clustering solutions derived from multiple homogenous datasets in the
Accepted 12 January 2014
cluster ensemble framework. In this article, we design a novel probabilistic cluster struc-
Available online 24 January 2014
ture ensemble framework, referred to as Gaussian mixture model based cluster structure
ensemble framework (GMMSE), to identify the most representative cluster structure from
Keywords:
the dataset. Specifically, GMMSE first applies the bagging approach to produce a set of var-
Cluster ensemble
Gaussian mixture model
iant datasets. Then, a set of Gaussian mixture models are used to capture the underlying
Normalized cut cluster structures of the datasets. GMMSE applies K-means to initialize the values of the
Structure ensemble parameters of the Gaussian mixture model, and adopts the Expectation Maximization
approach (EM) to estimate the parameter values of the model. Next, the components of
the Gaussian mixture models are viewed as new data samples which are used to construct
the representative matrix capturing the relationships among components. The similarity
between two components corresponding to their respective Gaussian distributions is mea-
sured by the Bhattycharya distance function. Afterwards, GMMSE constructs a graph based
on the new data samples and the representative matrix, and searches for the most repre-
sentative cluster structure. Finally, we also design four criteria to assign the data samples
to their corresponding clusters based on the unified cluster structure. The experimental
results show that (i) GMMSE works well on synthetic datasets and real datasets in the
UCI machine learning repository. (ii) GMMSE outperforms most of the previous cluster
ensemble approaches.
! 2014 Elsevier Inc. All rights reserved.
1. Introduction
With the development of cluster ensemble techniques, a growing number of related approaches have been successfully
applied to different fields [32,33,11,64,39,55], such as medicine, bioinformatics, and multimedia data mining. For example,
Iam-On et al. [32] proposed a link-based cluster ensemble approach based on the similarity between clusters, and success-
fully applied it to both artificial and real datasets. They also applied this approach to solve categorical data clustering
problem [33]. Christou [11] explored the optimization-based cluster ensemble approach which is formulated in terms of
intra-cluster criteria, and applied it to the TSPLIB benchmark data sets. Yu et al. [64] studied the knowledge based cluster
ensemble approach which is applied to perform cancer discovery from gene expression profiles. Mimaroglu and Aksehirli
⇑ Corresponding author. Tel.: +86 20 62893506; fax: +86 20 39380288.

E-mail address: zhwyu@scut.edu.cn (Z. Yu).
0020-0255/$ - see front matter ! 2014 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.ins.2014.01.030
Z. Yu et al. / Information Sciences 267 (2014) 16–34 17
[39] designed a divisive clustering ensemble approach called DICLENS, which is able to identify the cluster number automat-
ically and achieved good performance on gene expression data sets. Weber et al. [55] gave a general definition of optimal
clustering related to overlapping clustering solutions, which is useful for cluster ensemble approaches.
Compared with traditional clustering algorithms, cluster ensemble approaches represent a more effective technique since
they have the ability to generate a unified clustering solution from multiple clustering solutions in the ensemble, and im-
prove the effectiveness, stability and robustness of the clustering process. Most of the previous cluster ensemble approaches
focus on the alignment of the labels of data samples derived from diverse clustering solutions, and do not take into account
the fusion of multiple cluster structures obtained from various data sources into a unified structure. The cluster structure
which summarizes information about the distribution of the data samples is more useful in a lot of scenarios. For example,
as time passes, some data sources will gradually change, which will lead to the variation of the labels of data samples in dif-
ferent clustering solutions. In this scenario, the cluster structure of the data is more important than the labels of data sam-
ples. This raises an interesting question of how to construct a cluster structure ensemble, and identify the most
representative cluster structure among the datasets.
There are a lot of useful applications for a cluster structure ensemble approach. For example, multiple sensors will gen-
erate a lot of datasets which have their own cluster structures in the area of mobile Internet. At the same time, the cluster
structures of these datasets share a large number of similar characteristics. How to construct a unified cluster structure
which captures the similarity of the cluster structures in different datasets generated from multiple sensors is an interesting
problem deserving intensive exploration. For another example, the objective of clustering analysis on lung cancer datasets is
to assign samples to their corresponding classes. The Lung adenocarcinomas dataset in [7] contains 203 samples assigned to
5 classes: adenocarcinoma, small-cell lung cancer, pulmonary carcinoids, squamous cell lung carcinomas, and normal lung
tissues. Since there are a large number of datasets obtained by different research groups in the area of lung cancer research
[16], it raises the question of how to find the most representative cluster structure from the cluster structures obtained from
the different datasets.
In this paper, we design a new probabilistic cluster structure ensemble framework, referred to as the Gaussian mixture
model based cluster structure ensemble framework (GMMSE), to identify the most representative cluster structure from the
dataset. Specifically, GMMSE first integrates the bagging technique, the K-means algorithm and the Expectation–Maximiza-
tion approach to generate diversity, and estimate the various cluster structures from different data sources. Then, it adopts
the normalized cut algorithm [47] and the representative matrix constructed based on the set of cluster structures from dif-
ferent data sources to find the most representative cluster structure. Finally, GMMSE applies four assignment criteria, which
are the nearest Gaussian model criterion (NGM), the average Gaussian model criterion (AGM), the nearest group center cri-
terion (NGC) and the Gaussian model based majority voting criterion (GMV), to assign the data samples to their correspond-
ing clusters based on this most representative cluster structure. The results in the experiment show that GMMSE achieves
good performance on both synthetic datasets and real datasets in the UCI machine learning repository.
The contribution of the paper is fourfold. First, we proposed a Gaussian mixture model based cluster structure ensemble
framework (GMMSE) to identify the most representative cluster structure. Second, four criteria are designed to assign data
samples to their corresponding clusters based on this representative cluster structure. Third, the time and space complexity
of GMMSE are analyzed. Fourth, the representative matrix is designed to capture the relationships among the components of
the Gaussian mixture models. The Bhattacharya distance function is adopted to measure the similarity between two com-
ponents with respect to their respective Gaussian distributions.
The remainder of the paper is organized as follows. Section 2 introduces the related works to cluster ensemble ap-
proaches. Section 3 describes the Gaussian mixture model based cluster structure ensemble framework, and analyzes the
time and space complexity of GMMSE. Section 4 evaluates the performance of GMMSE through experiments on synthetic
datasets, as well as several real datasets in the UCI machine learning repository. Section 5 draws conclusions and describes
possible future works.
2. Related works
Recently, ensemble learning is gaining more and more attention since these approaches have the ability to provide more
accurate, stable and robust final clustering results when compared with traditional approaches. Most of the ensemble learn-
ing approaches [45,24,43] can be categorized into three types, which are supervised learning ensemble, semi-supervised
learning ensemble and unsupervised learning ensemble. Supervised learning ensemble, also called classifier ensemble, in-
cludes a number of popular approaches such as bagging [9], boosting [20], random forest [10], random subspace [27,23],
rotation forest [44], ensemble based on random linear Oracle [36], and ensemble based on neural networks [71,70,56]. Ada-
boost is an example of forming the ensemble at the learning process, which adjusts the weights of training data samples in
the learning process, and integrates multiple weak classifiers into a strong one. In other words, the main focus of Adaboost is
to adjust the weights of a sequence of classifiers in the learning process, and the classifier in the former will affect the per-
formance of the classifier in the latter. On the other hand, bagging is an example of forming an ensemble at the output stage,
which integrates multiple learning results into a more representative result by a suitable voting scheme. Supervised learning
ensemble includes the multi-view learning approach [21] and the consensus maximization approach [22].
18 Z. Yu et al. / Information Sciences 267 (2014) 16–34
We mainly focus on the third type. Unsupervised learning ensemble, also called cluster ensemble or consensus clustering,
mainly consists of two stages: ensemble generation and consensus combination of solutions. A cluster ensemble approach
will generate a set of clustering solutions which are as diverse as possible in the first stage, and then a suitable consensus
function will be selected to combine different clustering solutions into a unified one in the second stage. The existing cluster
ensemble approaches can be roughly classified into three categories.
The cluster ensemble approaches in the first category focus on how to design a new cluster ensemble approach. For exam-
ple, Strehl and Ghosh [48] first proposed the cluster ensemble framework based on three kinds of ensemble techniques to
improve the final result of single clustering algorithms. Fern and Brodley [17] designed two cluster ensemble methods which
are based on the random projection technique and the principal component analysis technique. Topchy et al. [51] focused on
how to find a unified representation in the ensemble using the EM algorithm. Fred and Jain [19] proposed the evidence
accumulation based cluster ensemble approach. Sevillano et al. [46] designed a soft cluster ensemble approach based on
the Borda voting scheme. Ayad and Kamel [3] used a probabilistic mapping and proposed the cumulative voting based clus-
ter ensemble method to search for the most representative partition. They [4] also formulated the voting problem as a
regression problem, and proposed a voting based cluster ensemble framework which integrates the bipartite matching tech-
nique, the information theoretic approach and the cumulative voting technique. Wang et al. [54] introduced Bayesian theory
into the cluster ensemble framework, and designed the Bayesian cluster ensemble approach. Domeniconi and Al-Razgan [15]
considered the weighted clusters obtained from different subspaces, and proposed a weighted cluster ensemble approach.
Zheng et al. [69] designed a hierarchical ensemble clustering framework which took into account the ultra-metric distance
during the process of hierarchical clustering. Recently, Pedrycz [42] proposed a new ensemble framework, referred as the
collaborative clustering approach, which considers the clustering process at both the level of data and communication be-
tween information granules. More researches about cluster ensemble approaches in the first category could be found in
[38,49,50,40].
The cluster ensemble approaches in the second category explore the properties of cluster ensemble approaches. For
example, Kuncheva and Whitaker and Kuncheva [37] and Hadjitodorov [35] investigated the relation between the diversity
and accuracy of an ensemble, and how to improve the ensemble accuracy with suitable diversity. They [34] also studied the
stability of K-means based cluster ensembles with respect to random initializations. Topchy et al. [52] focused more on the
convergence property of cluster ensemble approaches. Hadjitodorov et al. [26] investigated the diversity of cluster ensemble
approaches based on their proposed diversity measure, and concluded that moderate diversity is a suitable condition for
improving the performance of cluster ensemble techniques. Fern and Lin [18] explored how to select a subset of solutions
in the ensemble based on the quality and diversity. Amasyali and Ersoy [1] studied the effect of different factors on the per-
formance of cluster ensembles. Azimi and Fern [5] studied the effect of ensemble diversity, and proposed an adaptive cluster
ensemble selection approach. Hore et al. [28] proposed a framework to investigate the scalability of cluster ensemble ap-
proaches. Wang [53] investigated the efficiency of cluster ensemble approaches, and designed a hierarchical data cluster
structure based CA-tree to speed up the process of ensemble formation. Zhang and Wong [67] explored the effectiveness
of cluster ensembles, and proposed a generalized Adjusted Rand Index to measure the consistency between two partitions
of a data set.
The cluster ensemble approaches in the third category mainly focus on how to apply these approaches to different areas.
For example, Monti et al. [41] proposed a resampling based cluster ensemble approach, and applied it to explore the under-
lying cluster structure of gene expression data. Yu et al. [61] presented a graph-based clustering ensemble method, and uti-
lized it to detect the fundamental cluster structure of gene expression data. They [63] also proposed a new cluster ensemble
framework which employed the perturbation technique, the neural gas approach and a new cluster validity index to perform
class discovery from cancer microarray data. Greene et al. [25] applied an ensemble approach based on non-negative matrix
factorization to large-scale protein and protein interaction data analysis. Hu et al. [29] studied how to identify gene clusters
from gene expression datasets by cluster ensemble. Yang and Chen [57] applied a weighted cluster ensemble approach based
on different temporal data representations to perform clustering on temporal data. Ye et al. [58] incorporated domain knowl-
edge into the cluster ensemble framework, and applied it to automatic malware categorization. Iam-on et al. [31] presented a
linkage based cluster ensemble approach to perform microarray data analysis. Bassiou et al. [6] designed a speaker diariza-
tion system based on cluster ensemble approaches. Zhang et al. [68] explored how to perform data mining on data streams
based on the cluster ensemble approach. Zhang et al. [66] proposed a spectral cluster ensemble approach, and applied it to
SAR image segmentation. More studies about these approaches could be found in [68,59,60,62].
The majority of the existing researches on cluster ensemble only focus on how to integrate diverse clustering solutions,
i.e. the alignment of diverse cluster labels of samples, while overlook the important role of cluster structures in ensemble
formation, and leave out the problem of how to combine the respective cluster structures of multiple datasets in the ensem-
ble into a unified cluster structure. In this paper, we propose a Gaussian mixture model based cluster structure ensemble
framework (GMMSE) to address this issue. The generalized concept of the cluster structure ensemble framework is first pro-
posed by Yu et al. [65], which is used to find the most representative cluster structure from multiple cluster structures ob-
tained from different sets of data samples. They incorporated the re-sampling technique, the force based self-organizing map
and graph theory into a probabilistic bagging based cluster structure ensemble to search for a unified cluster structure. The
probabilistic bagging based cluster structure ensemble framework achieved good performance on both datasets from the UCI
repository and real cancer gene expression profiles. Compared with the above cluster structure ensemble approach, the pro-
posed cluster structure ensemble GMMSE is characterized by several features: (1) GMMSE uses the Gaussian mixture model
to capture the underlying cluster structure of the dataset, instead of using weight vectors of the neurons in a self-organizing
map (SOM) to approximate the structure. (2) It adopts the representative matrix to capture the relationships among the
components of Gaussian mixture models, and uses the Bhattacharya distance function to measure the similarity between
two components in the Gaussian mixture model. (3) GMMSE adopts four assignment criteria to distribute data samples
to their corresponding clusters based on the unified structure.
3. Gaussian mixture model based cluster structure ensemble
Fig. 1 provides an overview of the Gaussian mixture model based cluster structure ensemble framework (GMMSE). Spe-
cifically, GMMSE applies a re-sampling technique, for example the bagging technique, to generate a set of new datasets
F 1 ; F 2 ; . . . ; F B from the original dataset F in the first step. In the second step, the underlying cluster structure of each new data-
set F b ðb 2 f1; . . . ; BgÞ is captured by the Gaussian mixture model Hb . GMMSE adopts the K-means algorithm to initialize the
values of the parameters of Hb . In the third step, a set of models H1 ; H2 ; . . . ; HB are obtained by GMMSE based on the Expec-
tation Maximization approach. In the fourth step, GMMSE views each component of Hb as a new data sample, adopts the
Bhattycharya distance to calculate the similarity between a pair of new data samples corresponding to two Gaussian distri-
butions, and constructs an adjacency matrix and an attraction matrix. The representative matrix is further constructed based
on these matrices. In the fifth step, a suitable cluster structure consensus function is selected to partition the representative
matrix and obtained the most representative model H. In the sixth step, GMMSE uses the new model H to group the data
samples in the original dataset F into several clusters according to different assignment criteria to obtain the final result.
Given a dataset F with n data samples (F ¼ ff 1 ; f 2 ; . . . ; f n g, where n is the number of data samples), GMMSE selects a sub-
set of data samples by the bagging technique in the first step. Specifically, a constant nb , the number of data samples in the
subset, is randomly generated as follows:
nb ¼ nmin þ br1 ðnmax % nmin Þc ð1Þ
where r 1 (r 1 2 ½0; 1') is a uniform random variable. nmin and nmax , which are pre-specified parameters, control the number of
samples in the subset. GMMSE selects nb data samples successively. The subscripts of chosen samples are defined as follows:
i ¼ b1 þ r 2 nc ð2Þ
where i denotes the ith data sample in the dataset F, and r2 2 ½0; 1') is a uniform random variable. The selected nb samples are
applied to construct a new dataset F b . Similarly, GMMSE produces a series of new datasets F 1 ; F 2 ; . . . ; F B from the original
dataset F. The time complexity of the bagging technique is OðnÞ, while the space complexity of this technique is also OðnÞ
(the time and space complexity of an algorithm provide an estimate of how the computation time and memory increases
as a function of the input data sample size n [12]).
In the second step, GMMSE uses the Gaussian mixture model Hb to capture the underlying cluster structure of F b (where
b 2 f1; . . . ; Bg). The K-means algorithm is adopted to initialize the values of the parameters of the Gaussian mixture models
Fig. 1. Gaussian mixture model based cluster structure ensemble.

Hb , which is useful for reducing the computational cost. The time complexity of the K-means algorithm is OðnÞ, and the space
complexity is also OðnÞ.
Table 1 summarizes the parameters of the GMMSE approach. GMMSE initializes the parameter values of the mixture
model in the first iteration according to the approximate clusters obtained through K-means, which are as follows:
jcj j
pjð1Þ ¼ PK b ð3Þ
j¼1 ðjc j jÞ
X
f
f2cj
lð1Þ
j ¼ ð4Þ
jcj j
P
f i 2cj ðfi;h % f h Þðfi;l % f l Þ
Cov j ðh; lÞ ¼ ð5Þ
jcj j
GMMSE adopts the Expectation Maximization approach [14] to estimate the parameter values of the Gaussian mixture
model Hb in the third step. During the expectation step, the probability that the data sample f i belongs to the cluster cj
in the tth iteration is computed as follows
pðf i jcj Þ
pðtÞ ðcj jf i Þ ¼ PK ð6Þ
l¼1 pðf i jcl Þ
b
1 1 T
e½%2ðf i %lj Þ ðRj Þ ðf i %lj Þ'
%1
pðtÞ ðf i jcj Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffi ð7Þ
d
ð2pÞ jRi j
We then calculate the following sum:
ðtÞ
X pðf i jcj Þ
sj ¼ PK b ð8Þ
f 2F b
i
l¼1 pðf i jc l Þ
During the maximization step, the parameter values of the mixture model are updated as follows:
ðtÞ
sj
pjðtþ1Þ ¼ ð9Þ
nb
nb
1 X
ljðtþ1Þ ¼ ðtÞ
f i ( pðtÞ ðcj jf i Þ ð10Þ
sj i¼1
nb
1 X ðtþ1Þ T
Rjðtþ1Þ ¼ ðtÞ
ðpðtÞ ðcj jf i ÞÞðf i % lj
ðtþ1Þ
Þðf i % lj Þ ð11Þ
sj i¼1
EM alternately performs the expectation step and the maximization step until the change of the log likelihood of the mix-
ture model (jLtþ1 ðHb jF b Þ % Lt ðHb jF b Þj) becomes smaller than a threshold !, where
Kb
!
X X
LðHb jF b Þ ¼ log pj ( pðf i jcj Þ ð12Þ
f i 2F b j¼1
Table 1
The parameters of the EM algorithm.
Parameter Meaning
fi The ith data sample

jcj j Cardinality of the cluster cj
Kb Number of clusters in the initial step
pj Mixing proportion of the cluster cj
lj Mean vector of the cluster cj
Rj Covariance matrix of the cluster cj
Cov j ðh; lÞ Covariance between hth and lth dimension of the cluster cj
fh Average value of the data points in the hth dimension
h Parameters to be estimated
t tth iteration
nb Cardinality of the dataset F b
GMMSE adopts the EM algorithm to obtain a set of mixture models H1 ; H2 ; . . . ; HB . The computational cost of the EM algo-
rithm is OðT 2 Km2 nÞ (where T 2 denotes the number of iterations, K denotes the number of clusters, m denotes the number of
attributes and n denotes the number of feature vectors). Since T 2 ) n and K ) n , the time complexity of the EM algorithm is
Oðm2 nÞ, and the space complexity is also Oðm2 nÞ.
In the fourth step, GMMSE views each component of Hb as a new data sample, and creates a new dataset
C ¼ fc1 ; c2 ; . . . ; cn0 g (where n0 ¼ K 1 þ K 2 þ ( ( ( þ K B ; K b is the number of components in the Gaussian mixture model Hb ).
The similarity wðci ; cj Þ (i; j 2 f1; . . . ; n0 g) between two components ci and cj is measured by the Bhattacharyya distance [8],
which is defined as follows:
$ $
" #%1 $ Ri þ Rj $
Ri þ Rj 1 $ 2 $
wðci ; cj Þ ¼ ðli % lj ÞT ðli % lj Þ þ ln pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð13Þ
2 2 jRi jjRj j
where li ; lj ; Ri ; Rj are the means and variances of the clusters ci and cj . GMMSE constructs an adjacency matrix W whose
entries are defined as follows:
wij ¼ %wðci ; cj Þ ð14Þ
The adjacency matrix focuses on the relationship between any pair of Gaussian distributions ci and cj . If the Gaussian dis-
tribution ci is close to cj ; wij has a large value. If they are significantly different, the value of wij will be small.
GMMSE also takes into consideration the relationship between the Gaussian distribution ci and its K0 nearest neighbor
distributions (where K0 is a constant pre-specified by the user). The attraction matrix M is generated according to the rela-
tionship between the Gaussian distribution ci and its K0 nearest neighbor distributions, and its entry mij is defined as follows:
fðci Þ þ fðcj Þ
mij ¼ % ð15Þ
2
K0
1X
fðci Þ ¼ 0 wðci ; c*k Þ ð16Þ
K k¼1
K0
X
fc*1 ; . . . ; c*K 0 g ¼ arg min wðci ; ck Þ ð17Þ
fc1 ;...;cK 0 g+C
k¼1
fðcj Þ can be calculated in the same way as fðci Þ. If the K0 nearest neighbors are close to ci ; fðci Þ has a small value. Otherwise,
the value of fðci Þ is large. Attractive distribution pairs, i.e. fðci Þ and fðcj Þ have small values, which lead to a large mij value in
the attraction matrix M.
GMMSE utilizes the properties of the adjacency matrix as well as the attraction matrix to construct a representative
matrix A, whose entries aij are determined as follow:
aij ¼ x1 wij þ x2 mij ð18Þ
where x1 and x2 are custom parameters. The computational cost of constructing a representative matrix consists of two
parts: the time for constructing the adjacency matrix Oðn02 Þ and the time for constructing the attraction matrix
Oðn02 log n0 Þ. Since n02 < n02 log n0 , the time complexity for constructing the representative matrix is Oðn02 log n0 Þ, and the space
complexity is Oðn02 Þ.
In the fifth step, GMMSE constructs a graph (N ¼ ðC; AÞ) whose vertices represent data samples in the new dataset C, and
whose edges denote the similarity between data samples. The normalized cut approach proposed by Shi and Jitendra [47] is
applied to partition the graph into K components. This is a recursive process based on binary partition. The normalized cut
algorithm first partitions the vertex set C into two subsets, e.g. P and Q . The cost function XðP; Q Þ, which represents the dis-
association between P and Q , is defined as:
/ðP; Q Þ /ðP; Q Þ
XðP; Q Þ ¼ þ ð19Þ
uðP; CÞ uðQ ; CÞ
X
/ðP; Q Þ ¼ aij ð20Þ
ci 2P;cj 2Q
X
uðP; CÞ ¼ ail ð21Þ
ci 2P;cl 2C
where aij denotes the weight of the edge between the vertices ci and cj , which is also the corresponding entry value in the
representative matrix A.
Although finding the normalized cut is an NP-complete problem, an approximate solution can be obtained by considering
the discrete variables as continuous variables. Based on this constraint relaxation, an associated cost measure XðP; Q Þ can be
defined as follows:
P P
ðxi >0;xj <0Þ % aij xi xj ðxi <0;xj >0Þ % aij xi xj
XðP; Q Þ ¼ P þ P ð22Þ
xi >0 hi xi <0 hi
where x ¼ ½x1 ; . . . ; xn0 'T is an n0 dimensional real-valued vector (n0 is the number of data samples in the new dataset C), which
P
can be interpreted as the continuous version of an indicator vector for the partition, and hi ¼ j aij is the sum of the weights
from the vertex ci to all other vertices. In this way, the normalized cut problem can be recasted into a continuous optimi-
zation problem, in which NcutðxÞ is minimized as follows:
bT ðR % AÞb
minx XðxÞ ¼ minb ð23Þ
bT Rb
b ¼ ð1 þ xÞ % dð1 % xÞ ð24Þ
P
xi >0 hi
d¼P ð25Þ
xi <0 hi
with the constrains
bi 2 f1; %dg; bT R1 ¼ 0 ð26Þ

where R is an n , n diagonal matrix with hi (i 2 f1; . . . ; n g) on its diagonal, A is an n , n symmetric matrix with entries aij ,
0 0 0 0 0
and 1 denotes the identity matrix.

This optimization problem is solved by the following generalized eigenvalue system:
ðR % AÞb ¼ kRb ð27Þ
where k denotes an eigenvalue. The solution of the normalized cut problem would be the second smallest eigenvector of the
above system. Based on the partition results, the normalized cut algorithm will check the stability of the cut which is defined
as in [47], and ensure that the value of the objective function is below a pre-specified threshold. If the condition is not sat-
isfied, it will continue to subdivide the current partition results. The time complexity of the normalized cut algorithm is
Oðn03 Þ, while the space complexity is Oðn02 Þ.
Based on the partition result from the normalized cut algorithm, GMMSE partitions the new dataset C ¼ fc1 ; c2 ; . . . ; cn0 g
into K groups G1 ; G2 ; . . . ; GK , and learns the new cluster structure H. In the sixth step, GMMSE specifies four assignment cri-
teria to distribute data samples f 1 ; f 2 ; . . . ; f n to their corresponding groups based on the new cluster structure H. They are the
nearest Gaussian model criterion (NGM), the average Gaussian model criterion (AGM), the nearest group center criterion
(NGC) and the Gaussian model based majority voting criterion (GMV).
The nearest Gaussian model criterion (NGM) first finds the Gaussian model in the new dataset C which the data sample f i
has the highest probability to belong to:
*
h ¼ arg max pðf i jch Þ ð28Þ
h
where h 2 f1; . . . ; n0 g, and then adopts the function K to convert the label of ch* to the label of f i . This process is defined as
follows:
li ¼ Kðch* Þ ð29Þ
where li is the label of f i .
The average Gaussian model criterion (AGM) first calculates the average probability of the data sample f i to belong to a
particular Gaussian model in the group Gj .
P
c2Gj pðf i jcÞ
Cðf i ; Gj Þ ¼ ð30Þ
jGj j
Then, the label of f i is determined as follows:
li ¼ arg max Cðpi ; Gj Þ ð31Þ
j
The nearest group center criterion (NGC) first calculates the centers of the groups as follows:
P
l2Gj l
gj ¼ ð32Þ
jGj j
where j 2 f1; . . . ; Kg; jGj j is the cardinality of the group Gj , and l is the mean vector of the Gaussian model c obtained by
GMMSE. Then, the data point f i in the original dataset F is assigned to the corresponding group Gj* by the following criterion:
li ¼ arg min wðf i ; gj Þ ð33Þ
j
where w is the distance metric.

The Gaussian model based majority voting criterion (GMV) first identifies the model in each Gaussian mixture Hb which
the data sample f i has the highest probability to belong to:
cb* ¼ arg max pðf i jcÞ ð34Þ

c2Hb
where b 2 f1; . . . ; Bg. Then, a function K is adopted to map the label of cb* to the label of f i , which is defined as follows:
b
li ¼ Kðcb* Þ ð35Þ
b b
where li is the label of f i with respect to the Gaussian mixture model H .
B
X b
li ¼ arg max 1fli ¼ lg ð36Þ
l2f1;2;...;Kg
b¼1
li is the label of f i in the final result.

The time complexity of NGM, AGM and GMV is Oðn0 nÞ, while that of NGC is OðKnÞ. Since n0 ) n and K ) n, the overall time
complexity of the assignment criteria is OðnÞ. The space complexity is also OðnÞ. The total computational cost of GMMSE con-
sists of six components:
vGMMSE ¼ v1 þ v2 þ v3 þ v4 þ v5 þ v6 ð37Þ
where v1 denotes the computational cost of the bagging technique in the first step, v2 is the cost of constructing the Gaussian
mixture models in the second step, and v3 is the cost of estimating the parameter values of the mixture models by the Expec-
tation Maximization approach in the third step. v4 denotes the computational cost of generating the representative matrix in
the fourth step, v5 is the cost of the normalized cut algorithm in the fifth step, and v6 is the cost of assigning the input data
samples to the corresponding clusters based on the proposed criteria in the sixth step. The cost of each step depends on the
number of input data samples n in the original dataset, the number of attributes m and the number of components in the
Gaussian mixture models n0 as follows: v1 ¼ OðnÞ, v2 ¼ OðnÞ; v3 ¼ Oðm2 nÞ; v4 ¼ Oðn02 log n0 Þ, v5 ¼ Oðn03 Þ and v6 ¼ OðnÞ. As a
result, the time complexity of GMMSE is Oðm2 n þ n03 Þ. In addition, the space complexity of GMMSE is Oðm2 n þ n02 Þ.
4. Experiment
We conduct a number of experiments on synthetic datasets in Table 2 and real datasets from the UCI machine learning
repository in Table 3 to evaluate the performance of GMMSE.
Table 2 shows the details of the synthetic datasets (where k is the number of clusters, n is the number of data samples, m
is the number of attributes, and 1 is the number of noisy attributes), which are generated by different Gaussian distributions
with randomly selected centers, and the covariance matrices are of the form rI (I is the identity matrix). All the synthetic
datasets contain noisy attributes.
Table 3 provides a summary of the real datasets from the UCI machine learning repository that we use, which includes the
Iris dataset, the Breast Cancer Wisconsin (Original) dataset (WOBC), the Breast Cancer Wisconsin (Diagnostic) dataset
(WDBC), the Wine dataset, the Soybean (Small) dataset, the Statlog (Image Segmentation) dataset, the Pen-Based Recogni-
tion of Handwritten Digits testing dataset (Pentes) and the Optical Recognition of Handwritten Digits training dataset
(Opttra).
Table 2
The summary of synthetic datasets (where k is the number of clusters, n is the number of data samples, m is the number of attributes and 1 is the number of
noisy attributes).
Datasets Source k n m 1 r
Synthetic1 Synthesized by authors 5 160 20 5 0.2
Table 3
The summary of real datasets from the UCI machine learning repository.
Datasets Source k n m
Iris [2] 3 150 4
WOBC [2] 2 583 9
WDBC [2] 2 569 30
Wine [2] 3 178 13
Soybean (small) [2] 3 150 19
Segmentation [2] 7 2310 4
Pentes [2] 10 3498 16
Opttra [2] 10 3823 64
We adopt the Rand index [29] and the normalized mutual information [13] to evaluate the quality of clustering. Given the
true class partition P with K classes P ¼ fG1 ; G2 ; . . . ; GK g and the predicted partition P p with K p clusters P p ¼ fGp1 ; Gp2 ; . . . ; GpK p g,
the Rand index (RI) is defined as follows:
e1 þ e4
RI ¼ ð38Þ
e1 þ e2 þ e3 þ e4
where e1 denotes the number of sample pairs f i and f j assigned to the same class (cluster) in the true class partition P and the
predicted partition Pp . e2 denotes the number of sample pairs which belong to different classes (clusters) in P while are as-
signed to the same class (cluster) in Pp . e3 denotes the number of sample pairs which belong to the same class (cluster) in P
while assigned to different classes (clusters) in P p . e4 denotes the number of pairs which belong to different classes (clusters)
in P as well as in P p .
The normalized mutual information (NMI) [13] is defined as follows:
2CðP; Pp Þ
NMIðP; Pp Þ ¼ ð39Þ
DðPÞ þ DðPp Þ
XX jGk \ Gpj j njGk \ Gpj j
CðP; Pp Þ ¼ log ð40Þ
k j
n jGk jjGpj j
X jGk j jGk j
DðPÞ ¼ % log ð41Þ
k
n n
X jGpj j jGpj j
DðPp Þ ¼ % log ð42Þ
j
n n
where j; k 2 f1; . . . ; Kg; n is the total number of data samples, and j ( j denotes the cardinality of the cluster.
The performances of GMMSE and other ensemble approaches are measured by the average value and the corresponding
standard deviation of RI and NMI respectively after performing 20 runs. The final number of clusters in GMMSE and its com-
petitors, which include [48,19,31], is set to the number of clusters in the ground truth partition. The default sampling rate of
the bagging technique in GMMSE is set to 0.8.
In the following experiments, we first investigate the effect of the four newly proposed assignment criteria. Then, the
effect of the bagging technique will be explored. Next, we study the effect of two distance functions (the Bhattacharyya
distance function and the Mahalanobis distance function) which calculate the similarity between two Gaussian models.
Finally, we will compare GMMSE with several traditional single clustering algorithms and cluster ensemble approaches.
4.1. The effect of the four newly proposed assignment criteria
In order to evaluate the effect of the four newly proposed assignment criteria, which are the nearest Gaussian model cri-
terion (NGM), the average Gaussian model criterion (AGM), the nearest group center criterion (NGC) and the Gaussian model
based majority voting criterion (GMV), we apply them to the Iris dataset, the WDBC dataset, the Pentes dataset and the Opt-
tra dataset, and measure the resulting performance with respect to the average values and the corresponding standard devi-
ations of NMI and RI respectively.
Figs. 2 and 3 show the results obtained by GMMSE based on the different assignment criteria (NGM, AGM, NGC and GMV)
respectively. There are several interesting observations. First, GMMSE based on GMV achieves the best performance with
respect to NMI on the Iris dataset, the WDBC dataset and the Pentes dataset, and obtains the best results in terms of RI
on the Iris dataset and the WDBC dataset. The possible reason is that the GMV criterion not only takes into account the asso-
ciation between the samples of the newly generated dataset by the bagging technique and the components of a Gaussian
mixture model, but also adopts the voting scheme to obtain majority consensus on the data sample labels. Second, GMMSE
based on NGC works well on the Opttra dataset in the case of NMI, and obtains satisfactory results on the Pentes dataset and
the Opttra dataset with respect to RI. However,the performance of GMMSE based on NGC in terms of NMI and RI is not sat-
isfactory on the Iris dataset and the WDBC dataset. The possible reasons are as follows: (1) when the number of clusters is
small, the average centers selected by the NGC assignment from multiple Gaussian models cannot capture the complete
structure of the dataset, which lead to incorrect assignment of data samples. (2) When the number of clusters increases,
the number of centers will also increase. These centers will capture more information of the dataset, which in turn helps
NGC to achieve good performance. Third, the results obtained by GMMSE based on NGM and AGM are not as good as the
corresponding results based on GMV on all the datasets. As a result, GMMSE based on GMV is a better choice for most of
the datasets, especially when the number of clusters is small. GMMSE based on NGC is a suitable choice when the number
of clusters is large.
4.2. The effect of the bagging technique
The bagging technique is applied to generate diverse datasets in the first step of GMMSE. Accuracy and diversity are two
important criteria for the cluster ensemble approaches. The objective of the bagging technique is to generate a set of datasets
(a) The Iris dataset w.r.t NMI (b) The WDBC dataset w.r.t NMI
0.92
0.68
0.9
0.66
0.88
0.86 0.64
NMI
NMI
0.84 0.62
0.82 0.6
0.8
0.58
0.78
NGM AGM NGC GMV NGM AGM NGC GMV
(c) The Pentes dataset w.r.t NMI (d) The Opttra dataset w.r.t NMI
0.76
0.8
0.74
0.75
0.72
NMI
NMI
0.7 0.7
0.68
0.65
0.66
0.64 0.6
Fig. 2. The performance of GMMSE based on NGM, AGM, NGC and GMV with respect to NMI on the Iris dataset, the WDBC dataset, the Pentes dataset and
the Opttra dataset respectively.
(a) The Iris dataset w.r.t RI (b) The WDBC dataset w.r.t RI
0.96 0.89
0.95
0.88
0.94
RI
0.87
RI
0.93
0.86
0.92
0.91 0.85
(c) The Pentes dataset w.r.t RI (d) The Opttra dataset w.r.t RI
0.96
0.93
0.92 0.94
0.91
RI
0.92
RI
0.9
0.9
0.89
0.88
Fig. 3. The performance of GMMSE based on NGM, AGM, NGC and GMV in terms of RI on the Iris dataset, the WDBC dataset, the Pentes dataset and the
Opttra dataset respectively.
which are as diverse as possible, such that the accuracy of the final result can be improved. To investigate the effect of the
bagging technique, we vary the sampling rate from 0.5 to 1 with an increment of 0.1. The performance of GMMSE based on
0.95 0.66
NGC
GMV 0.64
0.9
NGC
NMI 0.62 GMV
NMI
0.85
0.6
0.8
0.58
0.75
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
0.75 0.76
0.7
0.74
0.65
NMI
NMI
0.72
0.6
0.7
0.55
NGC NGC
GMV GMV
0.5 0.68
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
Fig. 4. The performance of GMMSE based on GMV and NGC with different sampling rates in terms of NMI on the Iris dataset, the WDBC dataset, the Pentes
dataset and the Opttra dataset respectively.
0.96 0.9
0.94 0.89
0.88
0.92
0.87
RI
RI
0.9 NGC NGC

GMV 0.86 GMV
0.88
0.85
0.86 0.84
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
0.95 0.95
0.94
0.9
0.93
RI
RI
0.85 0.92
NGC
GMV 0.91
0.8
0.9 NGC
GMV
0.75 0.89
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
Fig. 5. The performance of GMMSE based on GMV and NGC with different sampling rates in terms of RI on the Iris dataset, the WDBC dataset, the Pentes
dataset and the Opttra dataset respectively.
0.9 0.68
0.85 0.66
0.8 0.64
NMI
NMI
0.75 0.62
0.7 0.6
0.65 0.58
0.6 0.56
Bh_NGC Bh_GMV Ma_NGC Ma_GMV Bh_NGC Bh_GMV Ma_NGC Ma_GMV
0.75 0.8
0.7
0.75
NMI
NMI
0.65
0.7
0.6
0.55 0.65
Fig. 6. The comparison results of GMMSE based on the Bhattacharyya distance and the Mahalanobis function with respect to NMI (where Bh_NGC and
Bh_GMV denote GMMSE based on GMV and NGC with the Bhattacharyya distance, and Ma_NGC and Ma_GMV denote GMMSE based on GMV and NGC with
the Mahalanobis distance).
0.95
0.89
0.9
0.88
0.85
0.87
RI
RI
0.8 0.86
0.75 0.85
0.7 0.84
0.96
0.92 0.94
0.9 0.92
RI
0.88 0.9
RI
0.86 0.88
0.86
0.84
0.84
0.82
Fig. 7. The comparison results of GMMSE based on the Bhattacharyya distance and the Mahalanobis distance with respect to RI.
GMV and NGC with different sampling rates is evaluated on the Iris dataset, the WDBC dataset, the Pentes dataset and the
Opttra dataset with respect to NMI and RI respectively. We adopt the bagging approach without replacement.
Figs. 4 and 5 show the results obtained by GMMSE with different sampling rates. It is observed that when the sampling
rate is close to 0.8, GMMSE achieves better performance. When the sampling rate deviates from 0.8, the results obtained by
GMMSE become less satisfactory. The possible reasons are as follows: (1) when the sampling rate is smaller than 0.8, the
diversity among newly generated datasets increases, but these datasets are not able to capture the cluster structure of
the original dataset, which will degrade the final performance. (2) When the sampling rate is higher than 0.8, the diversity
among the newly generated datasets decreases, which will affect the accuracy of the final result. (3) When the sampling rate
is close to 0.8, the accuracy and the diversity attain a balance. This indicates that a suitable bagging rate is able to generate
moderate diversity which is helpful to improve the accuracy of the final result as concluded in [37,26].
4.3. The effect of different distance functions
The Bhattacharyya distance and the Mahalanobis distance are two widely used distance functions for measuring the sim-
ilarity between two distributions. The main reason for comparing the effects of these two distance measures is that the Maha-
lanobis distance is widely adopted in cluster analysis and classification techniques, and it corresponds to a special case of the
(a) The Synthetic1 dataset w.r.t NMI (b) The Synthetic2 dataset w.r.t NMI (c) The Synthetic3 dataset w.r.t NMI
1
1 1
0.9
0.9
0.9
0.8
NMI
NMI
0.8
NMI
0.7 0.8
0.7
0.6
0.6 0.7
0.5
NGC GMV EMGMM KM FCM NGC GMV EMGMM KM FCM NGC GMV EMGMM KM FCM
(d) The Synthetic4 dataset w.r.t NMI (e) The Iris dataset w.r.t NMI (f) The WOBC dataset w.r.t NMI
1 0.9 0.8
0.9 0.85
0.75
0.8 0.8
0.7
NMI
NMI
NMI
0.75
0.7
0.7 0.65
0.6
0.65 0.6
0.5
0.6
0.55
(g) The WDBC dataset w.r.t NMI (h) The Wine dataset w.r.t NMI (i) The Soybean(s) dataset w.r.t NMI
0.7 0.9 0.9
0.6 0.8
0.8
0.5 0.7
0.7
NMI
0.6
NMI
NMI
0.4
0.6
0.5
0.3
0.5 0.4
0.2
0.4 0.3
0.1
0.3 0.2
0
(j) The Segment dataset w.r.t NMI (k) The Pentes dataset w.r.t NMI (l) The Opttra dataset w.r.t NMI
0.7 0.75 0.8
0.65 0.7
0.7
0.6
0.6
NMI
NMI
NMI
0.55 0.65
0.5
0.5
0.6 0.4
0.45
0.3
0.4 0.55
NGC GMV EMGMM KM FCM
NGC GMV EMGMM KM FCM NGC GMV EMGMM KM FCM
Fig. 8. The performance of GMMSE based on NGC and GMV, EMGMM, K-means and FCM (fuzzy c-means) with respect to NMI on both synthetic datasets in
Table 2 and real datasets from the UCI machine learning repository in Table 3 respectively.
Bhattacharyya distance when the standard deviations of the two classes are the same. Compared with the Bhattacharyya
distance, the Mahalanobis distance can be more efficiently calculated. If the condition of using the Mahalanobis distance is
approximately satisfied, then significant computation time can be saved. Figs. 6 and 7 show the comparison results of GMMSE
based on GMV and NGC with the Bhattacharyya distance and the Mahalanobis distance with respect to NMI and RI on the Iris
dataset, the WDBC dataset, the Pentes dataset and the Opttra dataset respectively. Bh_NGC and Bh_GMV denote GMMSE
based on GMV and NGC with the Bhattacharyya distance, while Ma_NGC and Ma_GMV denote GMMSE based on GMV and
NGC with the Mahalanobis distance. As shown in the figures, Bh_GMV achieves the best performance on the Iris dataset,
the WDBC dataset and the Pentes dataset, and Bh_NGC obtains the best results on the Opttra dataset. The possible reason
could be that in many of the scenarios considered above the special conditions under which the Mahalanobis distance works
best are not strictly satisfied.
4.4. Comparison with single clustering algorithms
The performance of GMMSE is compared with several single clustering algorithms which include the EM algorithm based
on Gaussian mixture models (EMGMM), the K-means algorithm and the fuzzy c-means algorithm (FCM) with respect to NMI
and RI on synthetic datasets in Table 2, and real datasets from the UCI machine learning repository in Table 3 respectively.
(a) The Synthetic1 dataset w.r.t RI (b) The Synthetic2 dataset w.r.t RI (c) The Synthetic3 dataset w.r.t RI
1
1 1
0.95
0.95
0.9 0.95
0.9
RI
RI
RI
0.85 0.85
0.9
0.8 0.8
0.85
0.75
0.75
0.7
0.8
0.7
(d) The Synthetic4 dataset w.r.t RI (e) The Iris dataset w.r.t RI (f) The WOBC dataset w.r.t RI
1 0.95
0.95
0.9 0.9 0.9
0.85
RI
0.8
RI
RI
0.85
0.8
0.7
0.75 0.8
0.6
(g) The WDBC dataset w.r.t RI (h) The Wine dataset w.r.t RI (i) The Soybean(s) dataset w.r.t RI
1 1
0.9 0.95
0.9
0.9
0.8
0.8
0.85
RI
RI
RI
0.7 0.8 0.7
0.75
0.6 0.6
0.7
0.5 0.65 0.5
(j) The Segment dataset w.r.t RI (k) The Pentes dataset w.r.t RI (l) The Opttra dataset w.r.t RI
0.88
0.95
0.86 0.92
0.9
0.84
0.9 0.85
0.82
RI
RI
RI
0.8
0.8 0.88
0.75
0.78
0.86 0.7
0.76
0.65
Fig. 9. The performance of GMMSE based on NGC and GMV, EMGMM, K-means and FCM (fuzzy c-means) with respect to RI on both synthetic datasets in
Table 2 and real datasets from the UCI machine learning repository in Table 3 respectively.
Figs. 8 and 9 show the results obtained by GMMSE based on NGC and GMV, EMGMM, K-means and FCM respectively.
GMMSE achieves comparable or better performance than EMGMM, KM and FCM on all the datasets. The possible reasons
are as follows: (1) GMMSE is capable of identifying the most representative cluster structure for different datasets. (2)
The new assignment criteria allow more accurate distribution of the data samples to their corresponding clusters. There
are several interesting observations from the figures. First, the average values of RI and NMI obtained by EMGMM are smaller
than those obtained by GMMSE, while the corresponding standard deviations of NMI and RI are much larger than those of
GMMSE. Although EMGMM is one of the major components in GMMSE, it is only suitable for datasets which can be ade-
quately approximated by Gaussian distributions. On the other hand, GMMSE combines multiple cluster structures into a uni-
fied structure, which makes it more robust and stable. Second, the performance of K-means is not as satisfactory as that of
GMMSE on all the datasets, especially on the synthetic datasets. While K-means is an efficient clustering algorithm which is
Table 4
The parameter settings for GMMSE based on NGC and GMV, KMSE, KMCE, EMCE and LCE based on CTS, SRS and ASRS (where B is the number of cluster
structures in the ensemble, K0 is the number of clusters in each basic clustering algorithm in the ensemble which is randomly selected from f2; . . . ; 2kg, k is the
true number of clusters in the datasets, K is the number of clusters in the final result, and t is the number of runs).
Approaches B K0 K t
NGC, GMV, KMSE, KMCE, EMCE, CTS, SRS, ASRS 20 Randomly selected k 20
(a) The Synthetic1 dataset w.r.t NMI (b) The Synthetic2 dataset w.r.t NMI
1 1
0.98
0.95
0.96
0.9
NMI
NMI
0.94
0.85 0.92
0.9
0.8
0.88
NGC GMV KMSE KMCE EMCE CTS SRS ASRS NGC GMV KMSE KMCE EMCE CTS SRS ASRS
(c) The Synthetic3 dataset w.r.t NMI (d) The Synthetic4 dataset w.r.t NMI
1.02 1
1
0.98
0.98
0.96 0.96
NMI
NMI
0.94 0.94
0.92
0.92
0.9
0.88 0.9
(e) The Iris dataset w.r.t NMI (f) The WOBC dataset w.r.t NMI
0.9 0.8
0.85 0.7
0.8 0.6
0.75
NMI
NMI
0.5
0.7
0.4
0.65
0.6 0.3
0.55 0.2
Fig. 10. The performance of GMMSE based on NGC and GMV, KMSE, KMCE, EMCE and LCE based on CTS, SRS and ASRS in terms of NMI on both synthetic
datasets in Table 2 and real datasets from UCI machine learning repository in Table 3 respectively.
used to initialize Gaussian mixture models in GMMSE, it is sensitive to noisy attributes, which will affect its performance.
Third, the results obtained by FCM are satisfactory, but not as good as those obtained by GMMSE. As a result, GMMSE is more
effective when compared with EMGMM, K-means and FCM.
4.5. Comparison with consensus cluster approaches
In the following experiments, GMMSE based on NGC and GMV are compared with conventional cluster ensemble
approaches in terms of NMI and RI on synthetic datasets in Table 2, and real datasets from the UCI machine learning repository
in Table 3 respectively. Conventional cluster ensemble approaches include the cluster structure ensemble based on K-means
(KMSE) in [65], the cluster ensemble approach based on the bagging technique, K-means and the normalized cut algorithm
(KMCE), the cluster ensemble approach based on the bagging technique, EM and the normalized cut algorithm (EMCE),
and the link-based cluster ensemble approach (LCE) in [32,31,30]. Since LCE based on average link achieves better results than
LCE based on single link and complete link, it is adopted in the experiments. According to different similarity measures, the
(g) The WDBC dataset w.r.t NMI (h) The Wine dataset w.r.t NMI
0.7
0.95
0.6 0.9
0.85
0.5 0.8
NMI
0.75
0.4 NMI
0.7
0.3 0.65
0.6
0.2 0.55
0.5
(i) The Soybean(s) dataset w.r.t NMI (j) The Segment dataset w.r.t NMI
0.95
0.9 0.65
0.85 0.6
0.8
0.55
0.75
NMI
NMI
0.7 0.5
0.65
0.45
0.6
0.55 0.4
0.5
0.35
(k) The Pentes dataset w.r.t NMI (l) The Opttra dataset w.r.t NMI
0.75 0.8
0.7 0.75
0.65 0.7
NMI
NMI
0.65
0.6
0.6
0.55
0.55
0.5
0.5
0.45
Fig. 11. The performance of GMMSE based on NGC, GMV, KMSE, KMCE, EMCE, and LCE based on CTS, SRS and ASRS in terms of NMI on both synthetic
datasets in Table 2 and real datasets from UCI machine learning repository in Table 3 respectively.
Table 5
The comparison of the run time (second) on the Iris dataset and the WDBC dataset.
Datasets NGC GMV KMSE KMCE EMCE CTS SRS ASRS

Iris 4.409 4.712 0.124 0.108 4.215 0.338 2.683 0.911
WDBC 4.864 8.628 0.246 0.533 6.288 2.232 24.302 7.867
Table 6
The comparison of the NMI value on the Iris dataset and the Wine dataset (where the NMI values of CTS, SRS and ASRS are directly
obtained from [32]).
Datasets NGC GMV CTS SRS ASRS

Iris 0.803 0.903 0.816 0.789 0.812
Wine 0.877 0.899 0.838 0.791 0.830
variants of LCE include LCE with connected-triple-based similarity (CTS), LCE with simRank-based similarity (SRS) and LCE
with approximate SimRank-based similarity (ASRS). Table 4 shows the parameter settings for GMMSE based on NGC and
GMV, KMSE, KMCE, EMCE and LCE based on CTS, SRS and ASRS.
Figs. 10 and 11 show the comparison results of GMMSE based on NGC, GMV, KMSE, KMCE, EMCE, and LCE based on CTS,
SRS and ASRS with respect to NMI. First, we can see that GMMSE based on GMV outperforms its competitors in six out of
twelve datasets, which are the Iris dataset, the WDBC dataset, the Pentes dataset, the Wine dataset, the Soybean-small data-
set and the image segmentation dataset. This indicates that GMMSE based on GMV assigns most of the data samples to their
corresponding clusters correctly. The main reasons are as follows: (1) GMMSE identifies the most representative cluster
structure from multiple structures which are obtained from different data sources. (2) It adopts the representative matrix
to capture the relationship among the components of the Gaussian mixture models, which will provide more information
for forming the ensemble. (3) GMMSE applies the Gaussian Model-based majority voting criterion (GMV) to assign data sam-
ples to the clusters as correctly as possible. Second, NGC, GMV, CTS, SRS and ASRS achieve very good performance on the
synthetic datasets. This indicates that both GMMSE and LCE are robust to noisy datasets, which could be due to the following
reasons: (1) The focus of GMMSE on the overall cluster structure, instead of the individual data points, reduce its sensitivity
to noise. (2) LCE applies the average link approach to de-emphasize the effect of noisy samples.
We also compare the run time of different consensus clustering approaches on the Iris dataset and the WDBC dataset as
shown in Table 5. There exist a tradeoff between the run time and the accuracy. For example, although KMSE and KMCE are
the most efficient algorithms, the results obtained by these two algorithms with respect to NMI are not very satisfactory,
especially on the Iris dataset and the Segmentation dataset as illustrated in Figs. 10 and 11. GMMSE based on NGC and
GMV require a longer run time, but they obtain better results with respect to the accuracy, which is more important for
the user in many scenarios. As a result, GMMSE is a good choice for both synthetic datasets and real datasets by taking into
account the importance of both the run time and the accuracy.
We also compare GMMSE based on NGC and GMV with LCE based on CTS, SRS and ASRS with respect to average NMI on
the Iris dataset and the Wine dataset. The results obtained by LCE based on CTS, SRS and ASRS are obtaineddirectly from [32]
as shown in Table 6. It is observed that GMMSE-based GMV outperforms its competitors and achieves the best performances
on both datasets. The average NMI values 0:903 and 0:899 obtained by GMMSE-based GMV on the Iris dataset and the Wine
dataset are 0:087 and 0:061 larger than those obtained by the best LCE method which are 0:816 and 0:838 respectively.
5. Conclusion and future work
In this paper, we propose a new cluster structure ensemble framework, referred to as the Gaussian mixture model based
cluster structure ensemble framework (GMMSE), for identifying the most representative cluster structure from data. Com-
pared with previous cluster ensemble approaches, GMMSE integrates multiple structures from different data sources into a
unified structure, and applies four new assignment criteria to distribute data samples to their corresponding clusters. We
perform a thorough analysis of GMMSE theoretically and experimentally, and draw conclusions as follows: (1) The adoption
of a suitable sampling rate and distance function will improve the performance of GMMSE. (2) The adoption of K-means to
initialize Gaussian mixture models in GMMSE is able to improve its efficiency. (3) GMMSE is robust to noisy datasets. In
addition, the results of the experiments indicate that GMMSE outperforms a number of state-of-the-art cluster ensemble ap-
proaches, such as LCE, on most of the datasets. In the future, we shall apply GMMSE to address different clustering problems
in machine learning, pattern analysis and data mining.
Acknowledgments
The authors are grateful for the constructive advice on the revision of the manuscript from the anonymous reviewers. The
work described in this paper was partially funded by the grant from the Hong Kong Scholars Program (Project No.
XJ2012015) and the outstanding talent training plan of South China University of Technology, and supported by grants from
the National Natural Science Foundation of China (NSFC) (Project Nos. 61003174, 61070090, 61273363, 61379033), the
NSFC-Guangdong Joint Fund (Project Nos. U1035004), the New Century Excellent Talents in University (Project No. NCET-
11-0165), the Guangdong Natural Science Funds for Distinguished Young Scholar (Project No. S2013050014677), a grant
from Science and Technology Planning Project of Guangzhou (Project No. 11A11080267), a grant from China Postdoctoral
Science Foundation (Project No. 2013M540655), a grant from Foundation of Guangdong Educational Committee (Project
No. 2012KJCX0011), a grant from the Fundamental Research Funds for the Central Universities (Project No. 2014G0007),
A grant from Key Enterprises and Innovation Organizations in Nanshan District in Shenzhen (Project No. KC2013ZDZJ0007A),
a grant from Natural Science Foundation of Guangdong Province, China (Project No. S2012010009961), a grant from the
Doctoral Program of Higher Education (Project No. 20110172120027), a grant from the Cooperation Project in Industry,
Education and Academy of Guangdong Province and Ministry of Education of China (Project No. 2011B090400032), a grant
from the City University of Hong Kong (Project No. 7004047) and the grants from the Hong Kong Polytechnic University
(G-YK77 and G-YK53).
References
[1] M.F. Amasyali, O. Ersoy, The performance factors of clustering ensembles, in: IEEE 16th Signal Processing, Communication and Applications Conference
(SIU 2008), 2008, pp. 1–4.
[2] A. Asuncion, D.J. Newman, UCI Machine Learning Repository, Irvine, CA: University. <http://www.ics.uci.edu/mlearn/MLRepository.html>.
[3] H.G. Ayad, M.S. Kamel, Cumulative voting consensus method for partitions with variable number of clusters, IEEE Trans. Pattern Anal. Mach. Intell. 30
(1) (2008) 161–173.
[4] H.G. Ayad, M.S. Kamel, On voting-based consensus of cluster ensembles, Pattern Recognit. 43 (5) (2010) 1943–1953.
[5] J. Azimi, X. Fern, Adaptive cluster ensemble selection, in: Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), 2009,
pp. 992–997.
[6] N. Bassiou, V. Moschou, C. Kotropoulos, Speaker diarization exploiting the eigengap criterion and cluster ensembles, IEEE Trans. Audio Speech Lang.
Process. 18 (8) (2010) 2134–2144.
[7] A. Bhattacharjee, W.G. Richards, J. Staunton, et al, Classification of human lung carcinomas by mRNA expression profiling reveals distinct
adenocarcinomas sub-classes, Proc. Natl. Acad. Sci. 98 (24) (2001) 13790–13795.
[8] A. Bhattacharyya, On a measure of divergence between two statistical populations defined by their probability distributions, Bull. Calcutta Math. Soc.
35 (1943) 99–109.
[9] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140.
[10] L. Breiman, Random forests, Mach. Learn. 45 (1) (2001) 5–32.
[11] I.T. Christou, Coordination of cluster ensembles via exact methods, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2) (2011) 279–293.
[12] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, third ed., MIT Press and McGraw-Hill, 2009. ISBN 0-262-03384-4.
[13] T.M. Cover, J.A. Thomas, Elements of Information Theory, second ed., John Wiley & Sons, 2006.
[14] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Stat. Soc. 39 (1) (1977) 1–38.
[15] C. Domeniconi, M. Al-Razgan, Weighted cluster ensembles: methods and analysis, ACM Trans. Knowl. Discovery Data (TKDD) 2 (4) (2009) 1–42.
[16] L. Dyrskjot, K. Lindblad-Toh, D.M. Tanenbaum, M.J. Daly, E. Winchester, W.O. Lui, A. Villapakkam, S.E. Stanton, C. Larsson, T.J. Hudson, B.E. Johnson,
et al, Loss-of-heterozygosity analysis of small-cell lung carcinomas using single-nucleotide polymorphism arrays, Nat. Biotechnol. 18 (9) (2000) 1001–
1005.
[17] X.Z. Fern, C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, in: Proc. 20th Int’l Conf. Machine
Learning, 2003, pp. 186–193.
[18] X.Z. Fern, W. Lin, Cluster ensemble selection, Stat. Anal. Data Min. 1 (3) (2008) 128–141.
[19] A.L.N. Fred, A.K. Jain, Combining multiple clusterings using evidence accumulation, IEEE Trans. Pattern Anal. Mach. Intell. 27 (6) (2005) 835–850.
[20] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–
139.
[21] K. Ganchev, J. Graca, J. Blitzer, B. Taskar, Multi-view learning over clustered and non-identical outputs, in: Proc. 2008 Conf. Uncertainty in Artificial
Intelligence (UAI’08), 2008, pp. 204–211.
[22] J. Gao, F. Liang, W. Fan, Y. Sun, J. Han, Graph-based consensus maximization among multiple supervised and unsupervised models, Adv. Neural Inform.
Process. Syst. 22 (2009).
[23] N. Garc?aa-Pedrajas, J. Maudes-Raedo, C. Garc?aa-Osorio, J. Rodr?aguez-D?aez, Supervised subspace projections for constructing ensembles of
classifiers, Inform. Sci. 193 (2012) 1–21.
[24] R. Ghaemi, M. Sulaiman, H. Ibrahim, N. Mutspha, A survey: clustering ensembles techniques, World Acad. Sci. Eng. Technol. 50 (2009).
[25] D. Greene, G. Cagney, N. Krogan, P. Cunningham, Ensemble non-negative matrix factorization methods for clustering protein? Cprotein interactions,
Bioinformatics 24 (15) (2008) 1722–1728.
[26] S.T. Hadjitodorov, L.I. Kuncheva, L.P. Todorova, Moderate diversity for better cluster ensembles, Inform. Fusion 7 (3) (2006) 264–275.
[27] T.K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20 (8) (1998) 832–844.
[28] P. Hore, L.O. Hall, D.B. Goldgof, A scalable framework for cluster ensembles, Pattern Recognit. 42 (5) (2009) 676–688.
[29] X. Hu, E.K. Park, X. Zhang, Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual
summarization, IEEE Trans. Inform. Technol. Biomed. 13 (5) (2009) 832–840.
[30] N. Iam-On, S. Garrett, LinkCluE: a MATLAB package for link-based cluster ensembles, J. Stat. Softw. 36 (9) (2010).
[31] N. Iam-on, T. Boongoen, S. Garrett, LCE: a link-based cluster ensemble method for improved gene expression data analysis, Bioinformatics 26 (12)
(2010) 1513–1519.
[32] N. Iam-On, T. Boongoen, S. Garrett, C. Price, A link-based approach to the cluster ensemble problem, IEEE Trans. Pattern Anal. Mach. Intell. 33 (12)
(2011) 2396–2409.
[33] N. Iam-On, T. Boongoen, S. Garrett, C. Price, A link-based cluster ensemble approach for categorical data clustering, IEEE Trans. Knowl. Data Eng. 24 (3)
(2012) 413–425.
[34] L.I. Kuncheva, Dmitry Vetrov, Evaluation of stability of k-means cluster ensembles with respect to random initialization, IEEE Trans. Pattern Anal.
Mach. Intell. 28 (11) (2006) 1798–1808.
[35] L.I. Kuncheva, S.T. Hadjitodorov, Using Diversity in Cluster Ensembles, SMC 2004, 2004, pp. 1214–1219.
[36] L.I. Kuncheva, J.J. Rodriguez, Classifier ensembles with a random linear oracle, IEEE Trans. Knowl. Data Eng. 19 (4) (2007) 500–508.
[37] L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy, Mach. Learn. 51 (2) (2003)
181–207.
[38] T. Lange, J.M. Buhmann, Combining Partitions by Probabilistic Label Aggregation, SIGKDD 2005, 2005, pp. 147–156.
[39] S. Mimaroglu, E. Aksehirli, DICLENS: divisive clustering ensemble with automatic cluster number, IEEE/ACM Trans. Comput. Biol. Bioinform. 9 (2)
(2012).
[40] A. Mirzaei, M. Rahmati, A novel hierarchical-clustering-combination scheme based on fuzzy-similarity relations, IEEE Trans. Fuzzy Syst. 18 (1) (2010)
27–39.
[41] S. Monti, P. Tamayo, J. Mesirov, T. Golub, Consensus clustering: a resampling based method for class discovery and visualization of gene expression
microarray data, J. Mach. Learn. 52 (2003) 1–2.
[42] W. Pedrycz, Collaborative and Knowledge-Based Fuzzy Clustering, John Wiley, N. York, 2006.
[43] W. Pedrycz, Granular Computing: Analysis and Design of Intelligent Systems, CRC Press/Francis Taylor, Boca Raton, 2013.
[44] J.J. Rodriguez, L.I. Kuncheva, C.J. Alonso, Rotation forest: a new classifier ensemble method, IEEE Trans. Pattern Anal. Mach. Intell. 28 (10) (2006) 1619–
1630.
[45] G. Seni, J. Elder, From trees to forests and rule sets – a unified overview of ensemble methods, in: Tutorial on KDD’07, San Jose, CA, 2007.
[46] X. Sevillano, F. Alías, J.C. Socoró, BordaConsensus: A New Consensus Function for Soft Cluster Ensembles, SIGIR 2007, 2007, pp. 743–744.
[47] J. Shi, M. Jitendra, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.
[48] A. Strehl, J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J. Mach. Learn. Res. 3 (2002) 583–617.
[49] A.P. Topchy, A.K. Jain, W. Punch, Combining multiple weak clusterings, in: Proc. IEEE Int’l Conf. Data Mining, 2003, pp. 331–338.
[50] A.P. Topchy, A.K. Jain, W. Punch, A mixture model for cluster ensembles, in: Proceedings of SIAM Conference on Data Mining, 2004, pp. 379–390.
[51] A.P. Topchy, A.K. Jain, W.F. Punch, Cluster ensembles: models of consensus and weak partitions, IEEE Trans. Pattern Anal. Mach. Intell. 27 (12) (2005)
1866–1881.
[52] A.P. Topchy, M.H.C. Law, A.K. Jain, A.L.N. Fred, Analysis of Consensus Partition in Cluster Ensemble, ICDM 2004, 2004, pp. 225–232.
[53] T. Wang, CA-Tree: a hierarchical cluster for efficient and scalable coassociation-based cluster ensembles, IEEE Trans. Syst. Man Cybernet. Part B:
Cybernet. 41 (3) (2011) 686–698.
[54] H. Wang, H. Shan, A. Banerjee, Bayesian cluster ensembles, Stat. Anal. Data Min. (2011) 54–70.
[55] M. Weber, W. Rungsarityotin, A. Schliep, Optimal clustering in the context of overlapping cluster analysis, Inform. Sci. 223 (2013) 56–74.
[56] J. Xiao, C. He, X. Jiang, D. Liu, A dynamic classifier ensemble selection approach for noise data, Inform. Sci. 180 (2010) 3402–3421.
[57] Y. Yang, K. Chen, Temporal data clustering via weighted clustering ensemble with different representations, IEEE Trans. Knowl. Data Eng. 23 (2) (2011)
307–320.
[58] Y. Ye, T. Li, et al., Automatic Malware Categorization using Cluster Ensemble, SIGKDD2010, 2010, pp.95–104.
[59] Z. Yu, Z. Deng, H.-S. Wong, L. Tan, Identifying Protein kinase-specific phosphorylation sites based on the Bagging–Adaboost ensemble approach, IEEE
Trans. NanoBioSci. 9 (2) (2010) 132–143.
[60] Z. Yu, L. Li, J. You, G. Han, SC3: triple spectral clustering based consensus clustering framework for class discovery from cancer gene expression profiles,
IEEE/ACM Trans. Comput. Biol. Bioinform. 9 (6) (2012) 1751–1765.
[61] Z. Yu, H.-S. Wong, H. Wang, Graph-based consensus clustering for class discovery from gene expression data, Bioinformatics 23 (21) (2007) 2888–
2896.
[62] Z. Yu, H. Chen, J. You, G. Han, L. Li, Hybrid fuzzy cluster ensemble framework for tumor clustering from bio-molecular data, IEEE/ACM Trans. Comput.
Biol. Bioinform. (2013), http://dx.doi.org/10.1109/TCBB.2013.59.
[63] Z. Yu, H.-S. Wong, Class discovery from gene expression data based on perturbation and cluster ensemble, IEEE Trans. NanoBioSci. 8 (2) (2009) 147–
160.
[64] Z. Yu, H.-S. Wong, J. You, Q. Yang, H. Liao, Knowledge based cluster ensemble for cancer discovery from bio-molecular data, IEEE Trans. NanoBioSci. 10
(2) (2011) 76–85.
[65] Z. Yu, J. You, H.-S. Wong, G. Han, From cluster ensemble to structure ensemble, Inform. Sci. 198 (2012) 81–99.
[66] X. Zhang, L. Jiao, F. Liu, L. Bo, M. Gong, Spectral clustering ensemble applied to SAR image segmentation, IEEE Trans. Geosci. Remote Sens. 46 (7) (2008)
2126–2136.
[67] S. Zhang, H.-S. Wong, ARImp: a generalized adjusted rand index for cluster ensembles, in: 20th International Conference on Pattern Recognition (ICPR),
2010, pp.778–781.
[68] P. Zhang, X. Zhu, J. Tan, L. Guo, Classifier and cluster ensembles for mining concept drifting data streams, in: 2010 IEEE 10th International Conference
on Data Mining (ICDM), 2010, pp.1175–1180.
[69] L. Zheng, T. Li, C. Ding, Hierarchical ensemble clustering, in: 2010 IEEE 10th International Conference on Data Mining (ICDM), 2010, pp. 1199–1204.
[70] Z.-H. Zhou, Y. Jiang, NeC4.5: Neural ensemble based C4.5, IEEE Trans. Knowl. Data Eng. 16 (6) (2004) 770–773.
[71] Z.-H. Zhou, J. Wu, W. Tang, Ensembling neural networks: many could be better than all, Artif. Intell. 137 (1–2) (2002) 239–263.

1 s2.0 S0020025514000504 Main

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

1 s2.0 S0020025514000504 Main

Hochgeladen von

Copyright:

Verfügbare Formate

Information Sciences 267 (2014) 16–34

Contents lists available at ScienceDirect

Probabilistic cluster structure ensemble

⇑ Corresponding author. Tel.: +86 20 62893506; fax: +86 20 39380288.

3. Gaussian mixture model based cluster structure ensemble

nb ¼ nmin þ br1 ðnmax % nmin Þc ð1Þ

Fig. 1. Gaussian mixture model based cluster structure ensemble.

We then calculate the following sum:

fi The ith data sample

with the constrains

bi 2 f1; %dg; bT R1 ¼ 0 ð26Þ

and 1 denotes the identity matrix.

where w is the distance metric.

cb* ¼ arg max pðf i jcÞ ð34Þ

li is the label of f i in the final result.

4.1. The effect of the four newly proposed assignment criteria

4.2. The effect of the bagging technique

NGM AGM NGC GMV NGM AGM NGC GMV

0.9 NGC NGC

Bh_NGC Bh_GMV Ma_NGC Ma_GMV Bh_NGC Bh_GMV Ma_NGC Ma_GMV

4.3. The effect of different distance functions

0.7 0.9 0.9

4.4. Comparison with single clustering algorithms

0.9 0.9 0.9

0.7 0.8 0.7

0.5 0.65 0.5

4.5. Comparison with consensus cluster approaches

Datasets NGC GMV KMSE KMCE EMCE CTS SRS ASRS

Datasets NGC GMV CTS SRS ASRS

5. Conclusion and future work

Das könnte Ihnen auch gefallen