Sie sind auf Seite 1von 5

International Journal on Communications (IJC) Volume 2 Issue 4, December 2013

www.seipub.org/ijc

An Efficient Clustering Algorithm for Peer-toPeer Networks


Chun-Liang Lee*1, Tzu-Ming Lin2
Department of Computer Science of Information Engineering, School of Electrical and Computer Engineering,
College of Engineering, Chang Gung University
259 Wen-Hwa 1st Road, Kwei-Shan, Tao-Yuan, Taiwan
cllee@mail.cgu.edu.tw; 2m9529022@stmail.cgu.edu.tw

*1

Abstract
Peer-to-peer systems and applications have attracted much
attention as they are more scalable than traditional clientserver ones. To provide efficient communications among
nodes in the network, node clustering can be utilized to
avoid flooding messages. In this paper, a distributed node
clustering algorithm was proposed which adopts a new way
to choose originators; then the ns-2 simulator was applied to
evaluate the proposed clustering algorithm. Experimental
results showed that the proposed algorithm can achieve
better clustering accuracy than existing algorithms for
different types of network topologies. More importantly, the
number of messages required for clustering is less than the
compared algorithms.
Keywords
Clustering Algorithms; Scaled Coverage Measure; Peer-to-Peer
Networks

Introduction
In recent years, peer-to-peer computing model has
drawn a great attention both from researchers and
general public because of its high scalability compared
to the traditional client-server model. Notable
applications include file sharing (Zhang et al. 2011),
video streaming (Magharei and Rejaie 2009; Ramzan et
al. 2011), and IP telephony (Bonfiglio et al. 2009). One
of the major issues on peer-to-peer computing is how
to provide high scalability with low communication
cost. One possible solution to this issue is to divide
nodes in the network into clusters. Through node
clustering, messages can be processed or merged in a
cluster and then sent to other nodes in other clusters.
Since message flooding is avoided, the communication
cost can be reduced significantly.
Node clustering can be done in both centralized and
distributed ways. Since centralized node clustering is
better suited for small networks, this paper focuses on
distributed clustering algorithms.
In the literature, a number of clustering algorithms

have been proposed. The MCL algorithm (Dongen


1998) is an efficient and accurate clustering algorithm.
However, it requires the knowledge of the global
information about the entire network. The
Connectivity-based Distibuted Clustering (CDC)
scheme (Ramaswamy, Gedik, and Liu 2005) is a
distributed clustering algorithm which can achieve a
comparable clustering accuracy as the MCL algorithm.
One major drawbacks of the CDC scheme is that the
clustering accuracy is not stable because of the way it
uses to choose originators. The SCM-based Distributed
Clustering (SDC) protocol (Li, Lao, and Cui 2011)
provides better and stable clustering accuracy
compared to the CDC scheme. As its name says, the
SDC protocol is based on the Scaled Coverage
Measure (SCM), which is a metric used to evaluate the
clustering quality. We will give more details about the
SCM later. One of the drawback of the SDC protocol is
that there are no originators in the network. For some
applications, originators can act as cluster heads that
are in charge of message consolidation and processing.
In this paper, a distributed clustering algorithm was
put forward which can achieve better clustering
accuracy than the SDC protocol. The proposed
algorithm acts similar to the CDC scheme; that is,
clusters are formed based on originators. Then a new
way was proposed to find good originators, which can
lead to a higher clustering accuracy than existing
algorithms.
The rest of this paper is organized as follows. In
Section 2, we reviewed two clustering algorithms that
are closely related to this work. Section 3 presents our
clustering algorithm. Experimental results and
discussion are given the Section 4. Finally, Section 5
concludes this paper and gives some future research
directions.
Related Work
This section briefly reviews two distributed clustering
133

www.seipub.org/ijc

International Journal on Communications (IJC) Volume 2 Issue 4, December 2013

algorithms that are most related to our work.


Connectivity-based Distributed Clustering Scheme
The CDC scheme (Ramaswamy, Gedik, and Liu 2005)
first selects a set of nodes to be originators, and then
forms a cluster for each originator. For a node that is
not an originator, it may choose an originator to join or
not join any originator. Therefore, the clustering
accuracy depends on which nodes are originators. In
order to select good originators, a node has to
exchanges messages with its neighbors so that a node
has the information about the degree of each node of
its neighbors. If a node Vl does not receive messages
that indicate some node claims itself a originator, it
will wait a random period of time and then a value
called the Two-Hop Probability (THP) is calculated as
follows:
1
THP(Vl ) =
),
(
Degree
(
V
)

Degree(Vi )
Vi Nbr (Vl )
l
where Nbr(Vl) denotes the set of neighbors of node Vl
and Degree(Vl) denotes the degree of node Vl. If the
value of THP is larger than a pre-defined threshold,
node Vl will claim itself an originator and send
messages to inform other nodes. Each message
contains fields of the following fields: originator ID,
message ID, weight, and time-to-live (TTL) value. The
value of weight is calculated as follows:
1
.
weight =
Degree(Vl )
Once a node Vi receives a message, it accumulates the
value of weight sent by each originator, reduces the
TTL field by 1, and updates the weight as follows:
weight
weight =
.
Degree(Vi )
If the TTL value becomes zero or the weight is smaller
than a pre-defined threshold, node Vi forwards the
message with the updated fields to its neighbors.
Otherwise, it discards the message.
After waiting a pre-defined period of time, a node
joins the cluster led by the originator with the largest
accumulated weight. The way that the CDC scheme
uses to select originators is not guaranteed that good
originators can be found. We will discuss this in the
next section.
SCM-based Distributed Clustering Protocol
The SDC protocol (Li, Lao, and Cui 2011) takes a
different way to cluster the network. Each node Vi
initially forms a cluster, which only includes itself.
Then, it sends messages to all its neighbors to request

134

the receiving nodes to estimate and return the gain in


SCM, denoted as SCM (V j ) , where Vj is the receiving
node. If Vi and Vj are in the same cluster, SCM (V j )
denotes the gain in SCM if node Vi leaves the cluster;
otherwise, SCM (V j ) denotes the gain in SCM if
node Vi joins the cluster of Vj. After receiving all
response messages, node Vi adds up all received
values. If the result is positive, node Vi leaves its
current cluster, and joins another cluster. Otherwise,
node Vi remains in its current cluster.
Proposed Clustering Algorithm
Motivation
For clustering algorithms based on connectivity, the
clustering accuracy is evaluated using the SCM. In
order to explain the motivation behind the proposed
algorithm, we briefly describe how SCM is calculated.
First of all, we define some terminologies:

Let G = (V,E) denote a network, where V is the


set of nodes and E is the set of edges.

Let Cl = {Cl1, Cl2, , Clo} denote a clustering on


the network G, which is divided into o clusters.

Let Nbr(Vi) denote the set of neighbors of node


Vi .

Let Cluster(Vi) denote the set of nodes that


belong to the same cluster as that of Vi. That is,
Clust
=
(Vi ) Cll iff Vi Cll .

Let FalsePositive(Vi, Cl) denote the set of all


nodes that are not neighbors of Vi, but belong
to the same cluster as that of Vi. That is,
False Positive(Vi , Cl ) =
{V j | V j Clust (Vi ) V j Nbr (Vi )} .

Let FalseNegative(Vi, Cl) denote the set of nodes


that are neighbors of Vi, but do not belong to
the same cluster as that of Vi. That is,
False Negative(Vi , Cl ) =
.
{Vk | Vk Nbr (Vi ) Vk Clust (Vi )}

Given a graph G and a clustering Cl, the SCM of a


node Vi is calculated as follows:
SCM (Vi , Cl ) =
FalsePositive(Vi , Cl ) + FalseNegative(Vi , Cl ) .
1Nbr (V i ) Clust (Vi )
The accuracy of a clustering for graph G is the average
SCM of all nodes. That is,
Vi V SCM (V i ,Cl )
ClustAccuracy (G, Cl ) =
V
Based on the calculation of the THP and SCM, it has
been found that originators selected using the THP do

International Journal on Communications (IJC) Volume 2 Issue 4, December 2013

not always form good clusters. In other words, it is


possible to find another set of originators that can lead
to better cluster accuracy. Fig. 1 shows a graph with 5
nodes. According to the CDC scheme, the THP of each
node can be calculated, as shown in Fig. 1. Since the
THP of node V4 is the highest one among all nodes,
node V4 is selected as an originator, and then a cluster
is formed, as indicated by the shaded area. The SCM
of each node can be calculated as follows:

SCM(V0, Cl) = 1 (2/6) = 0.66

SCM(V1, Cl) = 1 (3/6) = 0.5

SCM(V3, Cl) = 1 (3/6) = 0.5

SCM(V4, Cl) = 1 (0/5) = 1

SCM(V5, Cl) = 1 (3/5) = 0.4

Thus, the clustering accuracy is 0.612.


THP(V1)=1/(3*4)+1/(3*4)+1/(3*3)
=0.2777
1
THP(V4)=1/(4*3)+1/(4*4)+1/(4*3)+1/(4*1)
=0.4785

THP(V5)=1/(1*4)=0.25

THP(V2)=1/(3*3)+1/(3*4)+1/(3*3)
=0.30553
0

THP(V0)=1/(4*3)+1/(4*3)+1/(4*3)+1/(4*4)
=0.3125

THP(V3)=1/(3*3)+1/(3*4)+1/(3*4)
=0.2777

FIG. 1 CLUSTER GENERATED BY THE CDC SCHEME

However, if we choose node V0 as the originator, we


can get the formed cluster as shown in Fig. 2. The
ETHP values beside each node will be discussed later.
The value of SCM of each node can be calculated as:

SCM(V0, Cl) = 1 (0/5) = 1

SCM(V1, Cl) = 1 (1/5) = 0.8

SCM(V2, Cl) = 1 (1/5) = 0.8

SCM(V3, Cl) = 1 (1/5) = 0.8

SCM(V4, Cl) = 1 (2/6) = 0.66

ETHP(V1)=1/(3*3)+1/(3*2)+1/(3*2)
=0.444
1
ETHP(V2)=1/(3*2)+1/(3*2)+1/(3*2)
=0.5

ETHP(V5)=1/(1*4)=0.25
3

ETHP(V2)=1/(3*2)+1/(3*2)+1/(3*3)
=0.444

Enhanced Two-Hop Probability


A good originator set should possess the following
two properties (Ramaswamy, Gedik, and Liu 2005):

The set of originators should be spread out in


all regions of the network.

A node Vl is a good originator if it acquires


more weight due to the messages initiated by it
compared to the weight acquired by messages
initiated by any other originator.

In order to select originators with the second property,


the CDC scheme uses the THP to estimate the
probability of receiving messages initiated by itself in
two hops. However, according to the equation used to
calculate the THP, the smaller the degree of a node Vl
is, and the degree of each node in Nbr(Vl), the larger
the THP of Vl is. This is because if node Vl is selected as
an originator, the cardinalities of both FalsePositive(Vl,
Cl) and FalsePositive(Vl, Cl) are small, and so does each
node in Nbr(Vl), which leads to a large SCM. However,
the calculation of the THP does not consider the
possibility that the high degrees of neighbors of node
Vl may be caused by edges among these nodes. If node
Vl and its neighbors are in the same cluster, the
neighbors of node Vl do not belong to FalsePositive(Vl,
Cl). Therefore, the high degree of any node in this
cluster does not affect the value of SCM. Based on the
above analysis, a new way was proposed to calculate
the THP, called the Enhanced Two-Hop Probability
(ETHP). The following equation shows how the ETHP
is calculated, where Nbr (Vi ) Nbr (Vl ) denotes the
number of nodes that are both neighbors of node Vi
and Vl.
ETHP(Vl ) =

1
.

Vi Nbr (Vl ) Degree(Vl ) [ Degree(Vi ) Nbr (Vi ) Nbr (Vl ) ]

Thus, the cluster accuracy is 0.813. Obviously,


clustering shown in Fig. 2 has higher accuracy than
that in Fig. 1. The problem is how a clustering
algorithm can determine that the best originator is
node V0 rather than node V4.
ETHP(V4)=1/(4*2)+1/(4*2)+1/(4*2)+1/(4*1)
=0.625

www.seipub.org/ijc

ETHP(V0)=1/(4*1)+1/(4*1)+1/(4*1)+1/(4*2)
=0.875

FIG. 2 CLUSTER GENERATED WHEN NODE V0 IS AN


ORIGINATOR

The ETHP works similarly as the THP but the


messages sent by a node Vl have to contain the
neighbor list of Vl. Since the number of neighbors of a
nodes is generally small (less than 10), the additional
communication cost of ETHP is neglectable.
Let us use an example to explain how the ETHP is
calculated. In Fig. 2, the degree of V1 is 3, but V2 and V4
are neighbors of both V0 and V1. Thus, when V0
calculates the ETHP, V2 and V4 are excluded from the
degree of V1. The value for V1 is 1/(4*(3-2)) = 1/4, rather
than 1/(4*3) = 1/12 in the THP. Following the same way,
we can obtain the ETHP of each node, as shown in Fig.
2. Since V0 has the largest ETHP, it is selected as an
originator. As discussed before, clustering in Fig. 2 has
a higher clustering accuracy than that in Fig. 1.
135

www.seipub.org/ijc

International Journal on Communications (IJC) Volume 2 Issue 4, December 2013

Therefore, it has been shown that the proposed ETHP


can find good originators.
Weight Calculation
Similar to the CDC scheme, once originators have been
selected, they start to send messages to all neighbors.
One of the fields in a message is the weight, which is
used by the receiving nodes to determine which
cluster to join. Since the ETHP can find good
originators, the TTL of a message is set to 1. However,
this may cause a node unable to choose the right
cluster to join if it has two or more neighbors with the
same degree, and they are all originators. To solve this
problem, a new way was put forward to calculate the
weight, as shown in the following equation:
weight (Vl ) =
1
.
Degree(Vl ) [ Degree(Vi ) Nbr (Vi ) Nbr (Vl ) ]

FIG. 3 CLUSTERING ACCURACY ON RANDOM TOPOLOGIES

Experiments and Results


Experiment Settings
We use the ns-2 simulator (Ns2 2013) to evaluate the
performance of the proposed clustering algorithm.
Since the SDC protocol outperforms the CDC scheme
both in clustering accuracy and message overhead, we
only compare our algorithm with the SDC protocol.
Table 1 lists some important parameters of the SDC
scheme and the proposed algorithm. In order to study
the impact of network structures on clustering
algorithms, we use two different types of topologies:
range topology and power-law topology. The number
of nodes ranges from 200 to 1000. To reduce the
experimental variance, each experiment is performed
for 50 times, and the average values obtained are
reported.

FIG. 4 CLUSTERING ACCURACY ON POWER-LAW


TOPOLOGIES

TABLE 1 PARAMETER VALUES

Algorithm
SDC
Proposed Algorithm

Parameter
TTL
ETHP Threshold
Weight Threshold
TTL

Value
3
0.0005
0.0001
1

FIG. 5 MESSSAGE OVERHEAD ON RANDOM TOPOLOGIES

Experimental Results
Figs. 3 and 4 show the clustering accuracy of the SDC
protocol and the proposed algorithm (indicated by
ETHP) on random and power-law topologies,
respectively. It can be seen that the proposed
algorithm can achieve better clustering accuracy than
the SDC protocol for both types of topologies. This is
because the proposed ETHP can find better originators,
and thus improving the clustering accuracy.
136

FIG. 6 MESSAGES OVERHEAD ON POWER-LAW TOPOLOGIES

International Journal on Communications (IJC) Volume 2 Issue 4, December 2013

A good clustering algorithm has to not only achieve


good clustering accuracy, but also avoid generating
too many messages. Figs. 5 and 6 show the number of
messages generated by the SDC protocol and the
proposed algorithm on random and power-law
topologies, respectively. With the proposed clustering
algorithm, a node only needs to send the list of its
neighbors to other nodes. Then, good originators can
be selected. After that, other nodes can choose a
cluster to join by sending weight packets with TTL=1.
Therefore, the number of messages can be reduced
significantly. As it can be seen from Figs. 5 and 6, the
proposed algorithm generates less messages than the
CDC protocol.

In this paper, a distributed clustering algorithm has


been proposed which can yield better clusters than the
existing algorithms. In order to select good originators,
we analysed the weakness of the THP used in the CDC
scheme, and proposed the ETHP. Our experiments
indicated that the proposed algorithm can not only
achieve high clustering accuracy, but also generate less
messages than the SDC protocol. Although this paper
focuses on static networks, the proposed algorithm
can handle node dynamics in the same way as the
CDC scheme. It has been shown that the CDC scheme
needs to recluster the whole network periodically to
avoid degradation in clustering accuracy. Thus, one of
our future research directions is to enhance the
proposed clustering algorithm to deal with the node
dynamics more efficiently.
ACKNOWLEDGMENT

Transactions on Multimedia 11 (2009): 117-127.


Li, Yan, Li Lao, and Jun-Hong Cui. SDC: A Distributed
Clustering Protocol. International Journal of Computer
Networks 2 (2011): 205-226.
Magharei, Nazanin, and Reza Rejaie. PRIME: Peer-to-Peer
Receiver-Driven Mesh-Based Streaming. IEEE/ACM
Transaction On Networking 17 (2009): 1052-1065.
Ns2. The Network Simulator ns-2. Accessed November
27, 2013. http://www.isi.edu/nsnam/ns/.
Ramaswamy, Lakshmish, Bugra Gedik, and Ling Liu. A
Distributed

Approach

to

Node

Clustering

in

REFERENCES

Androutsellis-Theotokis, Stephanos, and Diomidis Spinellis.


of

actions on Parallel and Distributed Systems 16 (2005):


814-829.
Ramzan, Naeem et al. Peer-to-Peer Streaming of Scalable
Video

in

Future

Internet

Applications.

IEEE

Communications Magazine 49 (2011): 128-135.


Van Dongen, Stijn.A New Cluster Algorithm for Graphs.
Technical Report INS-R9814. Centrum voor Wiskunde en
Informatics (CWI), Amsterdan, The Netherlands, 1998.
Van Dongen, Stijn. Performance Criteria for Graph
Clustering and Markov Cluster Experiments. Technical
Report

INS-R0012.

Centrum

voor

Wiskunde

en

Informatics (CWI) , Amsterdan, The Netherlands, 2000.


Zhang, Chao, Prithula Dhungel, Di Wu, and Keith W. Ross.
Unraveling

the

BitTorrent

Ecosystem.

IEEE

Transactions on Parallel and Distributed Systems 22


(2011): 1164-1177.

This work was supported in part by the High Speed


Intelligent Communication (HSIC) research center,
Chang Gung University, Taiwan, and by grants from
Chang Gung University (UERPD270291), and the
National Science Council of Taiwan (NSC 98-2221-E182-033 and NSC 102-2221-E-182-034).

Survey

Rossi. Detailed Analysis of Skype Traffic. IEEE

Decentralized Peer-to-Peer Networks. IEEE Trans-

Conclusions

www.seipub.org/ijc

Peer-to-Peer

Content

Distribution

Technologies. ACM Computing Surveys 36 (2004): 335371.


Bonfiglio, Dario, Marco Mellia, Michela Meo, and Dario

Chun-Liang Lee received M.S. and Ph.D. degrees in


computer science and information engineering from
National Chiao Tung University, Hsinchu, Taiwan in 1997
and 2001, respectively. From 2002 to 2006, he worked with
the Telecommunication Laboratories, Chunghwa Telecom
Co., Ltd. Since February 2006, he has been an assistant
professor of Computer Science and Information Engineering
at Chang Gung University, Taoyuan, Taiwan. His research
interests include the design and analysis of network
protocols, quality of service in the Internet, and packet
classification algorithms.
Tzu-Ming Lin received his M.S. degree in computer science
and information engineering from Chang Gung University,
Taoyuan, Taiwan in 2008. His research interests are in
distributed computing systems.

137

Das könnte Ihnen auch gefallen