Beruflich Dokumente
Kultur Dokumente
AbstractThis paper is concerned with developing a distributed k-means algorithm and a distributed fuzzy c-means
algorithm for wireless sensor networks (WSNs) where each node
is equipped with sensors. The underlying topology of the WSN
is supposed to be strongly connected. The consensus algorithm
in multiagent consensus theory is utilized to exchange the measurement information of the sensors in WSN. To obtain a faster
convergence speed as well as a higher possibility of having the
global optimum, a distributed k-means++ algorithm is first proposed to find the initial centroids before executing the distributed
k-means algorithm and the distributed fuzzy c-means algorithm. The proposed distributed k-means algorithm is capable
of partitioning the data observed by the nodes into measuredependent groups which have small in-group and large out-group
distances, while the proposed distributed fuzzy c-means algorithm is capable of partitioning the data observed by the nodes
into different measure-dependent groups with degrees of membership values ranging from 0 to 1. Simulation results show
that the proposed distributed algorithms can achieve almost
the same results as that given by the centralized clustering
algorithms.
Index TermsConsensus, distributed algorithm, fuzzy
c-means, hard clustering, k-means, soft clustering, wireless
sensor network (WSN).
I. I NTRODUCTION
WIRELESS sensor network (WSN) consists of spatially
distributed autonomous sensors capable of limited computing power, storage space, and communication range to
monitor physical or environmental conditions, such as temperature, sound and pressure, and to cooperatively pass on
Manuscript received June 24, 2015; revised October 25, 2015 and January
24, 2016; accepted February 2, 2016. This work was supported in part by
the National Natural Science Foundation of China under Grant 61333012 and
Grant 61473269, in part by the Fundamental Research Funds for the Central
Universities under Grant WK2100100023, in part by the Anhui Provincial
Natural Science Foundation under Grant 1408085QF105, and in part by the
Australian Research Council under Grant DP120104986. This paper was
recommended by Associate Editor R. Selmic.
J. Qin and W. Fu are with the Department of Automation, University
of Science and Technology of China, Hefei 230027, China (e-mail:
jhqin@ustc.edu.cn; fwm1993@mail.ustc.edu.cn).
H. Gao is with the Research Institute of Intelligent Control and
Systems, Harbin Institute of Technology, Harbin 150001, China (e-mail:
hjgao@hit.edu.cn).
W. X. Zheng is with the School of Computing, Engineering and
Mathematics, Western Sydney University, Sydney, NSW 2751, Australia
(e-mail: w.zheng@westernsydney.edu.au).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCYB.2016.2526683
their data through the network to a main location. The development of WSNs was motivated by military applications such
as battlefield surveillance; today such networks are used in
many industrial applications, for instance, industrial and manufacturing automation, machine health monitoring, and so
on [1][6].
When using a WSN as an exploratory infrastructure, it is
often desired to infer hidden structures in distributed data collected by the sensors. For a scene segmentation and monitoring
application, nodes in the WSN need to collaborate with each
other in order to segment the scene, identify the object of
interest, and thereafter to classify it. Outlier detection is also a
typical task for WSN with applications in monitoring chemical
spillage and intrusion detection [14].
Many centralized clustering algorithms have been proposed
to solve the data clustering problem in order to grasp the fundamental aspects of the ongoing situation [7][11]. However,
in WSN, large amounts of data are dispersedly collected
in geographically distributed nodes over networks. Due to
the limited energy, communication, computation and storage
resources, centralizing the whole distributed data to one fusion
node to perform centralized clustering may be impractical.
Thus, there is a great demand for distributed data clustering algorithms in which the global clustering problem can be
solved at each individual node based on local data and limited
information exchanges among nodes. In addition, compared
with the centralized clustering, distributed clustering is more
flexible and robust to node and/or link failures.
Clustering can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster
and how to efficiently find a cluster. One of the most popular notions of the clusters is groups with small in-group
and large out-group differences [7][11]. Different measures
of difference (or similarity) may be used to place items into
clusters, including distance and intensity. A great deal of algorithms [7][10] have been proposed to deal with the clustering
problem based on distance. Among them, the k-means algorithm is the most popular hard clustering algorithm, which
features simple and fast-convergent iterations [7].
Parallel and distributed implementations of the k-means
algorithm have risen most often because of its high efficiency in dealing with large data sets. The algorithm proposed
in [12] is the parallel version of the k-means algorithm in
multiprocessors of distributed memory, which has no communication constraints. In [13], observations are assigned to
c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
2168-2267
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
(1)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
QIN et al.: DISTRIBUTED k-MEANS ALGORITHM AND FUZZY c-MEANS ALGORITHM FOR SENSOR NETWORKS
(2)
with
jNi
where the
parameter is assumed to be
(1/maxi ( j=i aij )). Suppose that the times of iteration
is tmax , then the computational complexity is O(ntmax ) for a
node since |Ni | n.
Remark 2: Note that the result of the average-consensus
algorithm is only asymptotically correct. This is different from
the max-consensus/min-consensus algorithm, which can provide an exact result after a finite number of iterations. In the
following, we will introduce the finite-time average-consensus
algorithm, which can get the exact average in finite-time steps.
Introduce W = [wij ] RNN with
i = j
aij ,
wij =
a
,
i = j.
1 N
k=1 ik
[xi ( ) xi ( 1) xi (0)]S
[1 1 1]S
(5)
1 +
1 + + 1
S=
..
.
1 + j=1 i
(3)
[xi ( ) xi ( 1) xi (0)]S
.
[1 1 1]S
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
k
xi cj
2
(6)
j=1 xi Si
(7)
*\
*\
a given k, so does the distributed one. Thus, an effective distributed initial centroids choosing method is important. Here
we provide a distributed implementation of the k-means++
algorithm [24], a centralized algorithm to find the initial centroids for the k-means algorithm. It is noted that k-means++
generally outperforms k-means in terms of both accuracy and
speed [24].
Let D(x) denote the distance from observation x to the
closest centroids that have been chosen. The k-means++
algorithm [24] is executed as follows.
1) Choose randomly an observation from {x1 , x2 , . . . , xn }
as the first centroid c1 .
x from the observations
2) Take a new center cj , choosing
with probability D(x)2 /( x{x1 ,x2 ,...,xn } D(x)2 ).
3) Repeat step 2) until we have taken k centers altogether.
Consider that the n nodes are deployed in an Euclidean
space of any dimension and suppose that each node has a
limited sensing radius which may be different for different
nodes, therefore leading to the fact that the underlying topology of the WSN is directed. Each node is endowed with a real
vector xi Rd representing the observation. Here we assume
that every node has its own unique identification (ID), and
further the underlying topology of the WSN is strongly connected. The detailed realization of the distributed k-means++
algorithm is as shown in Algorithm 1.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
QIN et al.: DISTRIBUTED k-MEANS ALGORITHM AND FUZZY c-MEANS ALGORITHM FOR SENSOR NETWORKS
(a)
(b)
Fig. 1.
Flowchart for proposed algorithms. Flowchart for distributed
(a) k-means algorithm and (b) c-means algorithm.
it belongs to. During the update step, the finite-time averageconsensus algorithm can be utilized to choose the mean of each
cluster as the new centroid. Note that some pretreatments such
as weight balancing and normalization [see Fig. 1(a)] need to
be done before the distributed algorithm is being executed.
The flowchart for the distributed k-means algorithm is shown
in Fig. 1(a) and the concrete realization of the algorithm is
shown in Algorithm 2.
Note that in Algorithm 2,
mirror-imbalance-correcting()
is the mirror imbalance-correcting algorithm in [33] and
distributed-k-means++()
is the distributed k-means++ algorithm proposed in
Section III-B.
For node i, it knows its observation xi Rd , n (the number
of the nodes), k (the number of the clusters), as well as Ni (the
set of node is neighbors). All the nodes execute the distributed
k-means algorithm synchronously, and the final centroids c and
the cluster ki which it belongs to, are obtained by every node.
The whole algorithm is executed as follows.
1) Weight Balancing: To compute the average of a number of observations by the finite-time average-consensus
algorithm, the strongly connected directed graph needs to
be weight balanced. The mirror imbalance-correcting algorithm [33] is executed to assign weights to node is outgoing
edges to make the network be weight balanced.
2) Normalization: A minmax value standardization
method is used to normalize the components in the observations. The max-consensus and min-consensus algorithms [19]
are used first to obtain the maximum and minimum values of
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
*\
ai = mirror-imbalance-correcting(Ni );
\* Normalization
*\
]
*\
=
c(1)
distributed-k-means++(xi , n, k, Ni );
T = 1;
while true do
\* Assignment step
*\
ki = argminj
xi cj (T)
;
\* Update step
*\
T = T + 1;
for j = 1; j k; j + + do
if ki == j then
nic = 1;
else
nic = 0;
end
cj (T) = average-consensus(nic x ic , Ni , ai , n);
nj = average-consensus(nic , Ni , ai , n);
cj (T) = cj (T)/nj ;
end
c(T) = [c1 (T) , . . . , ck (T) ] ;
\* Check convergence
*\
xij minj
.
maxj minj
(9)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
QIN et al.: DISTRIBUTED k-MEANS ALGORITHM AND FUZZY c-MEANS ALGORITHM FOR SENSOR NETWORKS
2
um
ij xi cj
(10)
i=1 j=1
and
uij =
1
, i = 1, . . . , n, j = 1, . . . , k. (12)
k xi cj 2/(m1)
t=1 xi ct
Similar to the k-means algorithm, the fuzzy c-means algorithm also uses an iterative refinement technique. Given an
initial set of k centers c1 (1), . . . , ck (1) or an initial membership matrix, the algorithm proceeds by alternating between
updating cj and U according to (11) and (12).
Remark 5: The algorithm minimizes intracluster variance
as well, but has the same problems as the k-means algorithm,
that is, the obtained minimum is a local minimum and the
results depend on the initial choice.
B. Distributed Fuzzy c-Means Algorithm
In this section, a proposed distributed fuzzy c-means algorithm will be presented in detail.
We make the same assumptions as the distributed k-means
algorithm, i.e., the n nodes deployed in any high dimensional
space are endowed with real vectors xi Rd , i = 1, . . . , n,
which represent the observations, and the underlying topology
of the WSN is strongly connected.
The goal of the proposed algorithm is to obtain centers
representing different clusters and the membership matrix
describing to what extent a node belongs to a cluster so
as to minimize the objective function described in (10). We
follow the same step as that for the centralized algorithm
except that every step is performed in a decentralized manner. If all the centers can be spread across the network
via the max-consensus algorithm, then each node can calculate the membership matrix directly according to (12).
When computing the new centers, the finite-time averageconsensus algorithm can be utilized. The flowchart for the
distributed fuzzy c-means algorithm is shown in Fig. 1(b)
and the concrete realization of the algorithm is given in
Algorithm 3.
For node i, it knows its observation xi Rd , n (the number
of the nodes in the network), k (the number of the clusters), Ni (the set of neighbors of node i), as well as (the
parameter to judge whether the algorithm converges). All the
nodes execute the distributed fuzzy c-means algorithm synchronously and the final centers c and the membership value
uij describing to what extent node i belongs to cluster j, are
obtained by every node. The whole algorithm is executed
as follows.
1) Weight Balancing, Normalization and Choice of the
Initial Centers: These steps are the same as the distributed
k-means algorithm.
2) Calculate Membership Matrix: Since the centers have
been provided in step 1), each node can calculate the membership matrix as shown in (12).
and
3) Calculate New Centers: Each node provides um
ij x
m
uij to participate in the finite-time average-consensus algorithm. So the
resulting finite-time
average-consensus algorithm
n
m
x
/n)
and
(
should be ( ni=1 um
i=1 uij /n). Apparently, the
ij i
ratio of the two results is the new center according to (11).
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8
*\
*\
*\
*\
i j
k
t=1 xi ct
end
\* Calculate new centers
*\
T = T + 1;
for j = 1; j k; j + + do
i , Ni , ai , n);
cj (T) =average-consensus(um
ij x
,
N
nj = average-consensus(um
i , ai , n);
ij
cj (T) = cj (T)/nj ;
end
c(T) = [c1 (T) , . . . , ck (T) ] ;
\* Check convergence
if
c(T) c(T 1)
< then
break;
end
end
c = c(T);
*\
V. S IMULATION R ESULTS
Now we provide several examples to illustrate the capabilities of the proposed distributed k-means algorithm and
distributed fuzzy c-means algorithm.
Example 1: In this example, we consider that the network
topology is an undirected graph. Here the 40 nodes are randomly selected in the region [0, 1][0, 1] and the observation
of each node is the position. We need to partition the 40 measures into four clusters. The network is shown in Fig. 2 and the
initial centroids chosen by the distributed k-means algorithm
are marked by red cross.
First, we use the proposed distributed k-means algorithm
to deal with these observations and the clustering result is
depicted in Fig. 3, in which the different types of points mean
different clusters while the final centroids are marked by red
cross.
And then the proposed distributed fuzzy c-means algorithm
is used to deal with the observations in Fig. 4. Considering that
soft clustering cannot get a fixed partition, we make the following treatment so that the clustering result can be observed
directly. For node i, if uij 0.5, then we partition node i
into cluster j. If uij < 0.5 for all j, then we call it fuzzy
points presented by plus in the simulation results. The simulation result is depicted in Fig. 4. Note that the fuzzy points
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
QIN et al.: DISTRIBUTED k-MEANS ALGORITHM AND FUZZY c-MEANS ALGORITHM FOR SENSOR NETWORKS
Fig. 4. Clustering result with final centroids of the distributed fuzzy c-means
algorithm in the undirected graph.
Fig. 7. Clustering result with final centroids of the distributed fuzzy c-means
algorithm in the directed graph.
from the nine classes and 400 data points are chosen from a
class.
Fig. 8(b) depicts the evolution of the WCSS functions of
the distributed k-means algorithm with k = 6, 9, 12 and of
the centralized k-means algorithm with k = 9. This illustrates
that the convergence of the proposed distributed implementation is similar to the centralized algorithm. The clustering
result of the distributed k-means algorithm with k = 9 is
displayed in Fig. 8(a). Table I compares the performance
including the final value of the WCSS function and the iteration times before convergence over 100 Monte Carlo runs
for the distributed k-means algorithm (represented by dkmeans), the k-means algorithm with first centroids chosen by
using the k-means++ algorithm (represented by k-means++),
and the k-means algorithm with first centroids randomly chosen from the observations (represented by the k-means). It can
be observed that the distributed k-means algorithm performs
the best both in terms of speed and accuracy.
Fig. 9(b) depicts the evolution of the objective functions
of the distributed fuzzy c-means algorithm with k = 6, 9, 12
and of the centralized fuzzy c-means algorithm with k = 9.
This illustrates that the convergence of the proposed distributed
implementation is similar to the centralized algorithm. The
clustering result of the distributed fuzzy c-means algorithm
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
10
(a)
(b)
(a)
(b)
TABLE I
C ENTRALIZED V ERSUS D ISTRIBUTED k-M EANS .
P ERFORMANCE C OMPARISON
TABLE II
C ENTRALIZED V ERSUS D ISTRIBUTED F UZZY c-M EANS .
P ERFORMANCE C OMPARISON
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
QIN et al.: DISTRIBUTED k-MEANS ALGORITHM AND FUZZY c-MEANS ALGORITHM FOR SENSOR NETWORKS
11
Fig. 10. Evolution of the WCSS functions of the distributed k-means algorithm with k = 10, 25, 50 in the undirected graph with 1024 10-D observations
from cloud dataset.
Fig. 11. Evolution of the objective functions of the distributed fuzzy c-means
algorithm with k = 10, 25, 50 in the undirected graph with 1024 10-D
observations from cloud dataset.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
12