Beruflich Dokumente
Kultur Dokumente
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai 200093, China
Fudan University, 825 Zhangheng Road, Shanghai 201203, China
art ic l e i nf o
a b s t r a c t
Article history:
Received 17 October 2014
Received in revised form
14 March 2015
Accepted 29 April 2015
Communicated by Liang Lin
Available online 15 May 2015
Computing similarity between two nodes in directed graphs plays an increasingly important role in
various research elds, including clustering, collaborative ltering and community mining. Many
similarity measures have been devoted in recent years, such as SimRank, PSimRank and SimFusion.
However, these measures consider only the expected meeting probability of equal path length, which
may omit some latent similar nodes. Besides, the link importance of each edge is not distinguished,
which may lead to unreasonable rankings while searching similar nodes. In this paper, we propose an
effective structural-based similarity measure, ESimRank, for effectively computing similarities in
directed graphs. We rstly dene effective relationship strength (ERS) to distinguish link importance
by utilizing node activity, node attraction and link frequency. And then we formalize ESimRank equation
by combining ERS and the expected meeting probabilities of any path length. Compared to existing
similarity measures, ESimRank can nd more latent similar nodes and give ranking of better quality. For
supporting fast similarity computation, we develop an extended partial sums-based algorithm, which
reduces the time complexity signicantly. Extensive experiments demonstrate the effectiveness and
efciency of ESimRank by comparing with the state-of-the-art similarity measures.
& 2015 Elsevier B.V. All rights reserved.
Keywords:
Similarity
Effective relationship strength
ESimRank
1. Introduction
Many real networks can be represented as directed graphs, for
example, e-mail network where a node represents an e-mail user and
an edge implies the delivery relationship that an e-mail being send
from one user to another; citation network where a node represents a
paper and an edge implies the citation relationship between papers;
web network where a node represents a web page and an edge
implies the hyper-link one web page points to another. With these
networks becoming massive and diverse, the requirement for discovering valuable knowledge from directed graphs has become a
signicant task. Similarity computation between nodes in directed
graph [2,9,12,14,31,34,38] is one important aspect of network analysis,
which is required by many real applications to evaluate the underlying
similarity between nodes, including clustering [1,33,35], community
mining [18,21] and recommendation system [5,11,27]. For satisfaction
of the above requirement, an effective similarity evaluation function is
required for answering the question How similar are these two
n
Corresponding author at: College of Communication and Art Design, University
of Shanghai for Science and Technology, 516 Jungong Road, Shanghai 200093, China.
E-mail addresses: mingxizhang10@fudan.edu.cn (M. Zhang),
huhao@fudan.edu.cn (H. Hu), zhenying@fudan.edu.cn (Z. He),
lipinggao@fudan.edu.cn (L. Gao), liujiesunx@163.com (L. Sun).
http://dx.doi.org/10.1016/j.neucom.2015.04.084
0925-2312/& 2015 Elsevier B.V. All rights reserved.
148
2. Related work
Due to the practical signicance of similarity computation in
directed graphs, many approaches have been devoted in recent
p4
p3
p2
p1
p7
p6
p5
Rank
1
2
3
4
5
6
SimRank
(p6,0.640)
N/A
N/A
N/A
N/A
N/A
p11
p12
p13
p14
p15
p8
p9
p10
Rank
1
2
3
4
5
6
SimRank
(p11,0.400)
(p12,0.400)
(p13,0.400)
(p15,0.200)
N/A
N/A
Fig. 1. Example of directed graphs: two fractions of citation network. (a) Returning result for given query node p2 , (b) returning result for given query node p14 .
149
a-v0 -v1 -xb and arrive at common node v0 , while this case is
considered by ESimRank.
Lin et al. [23] proposed a model, called the Extended Neighborhood Structure (ENS), which utilizes both in- and out-relationships
to extend existing similarity measures. Based on ENS, most of
existing similarity measures can be extended, such as SimRank,
SimFusion and our proposed ESimRank. ENS is a general framework of similarity computation, the effectiveness of ENS depends
on the similarity measures which are extended. Although ESimRank also considers in- and out-links for similarity computation, it
is different to ENS, since these two factors are exploited for
dening ERS which is used for distinguishes link importance.
ESimRank is not conict with ENS in practice, since ENS can be
easily integrated into ESimRank for further improving the performance of similarity computation.
3. ESimRank
As mentioned above, the existing similarity measures consider
only the expected meeting probability of equal path length and do
not give enough attention to link importance, which may decrease
the effectiveness of returned result. For overcoming these two
drawbacks, in this section we propose a comprehensive similarity
measure ESimRank by utilizing ERS for distinguishing link importance and considering the meetings of any path length.
3.1. Preliminaries
Before we further discuss on similarity computation, we rst
give the denition of directed graph for the subsequent
discussions.
Denition 1 (Directed graph). A directed graph is noted by
GV ; E; W, where V is a set of nodes, E is a set of links and W is
a set of weights. Edge eu; v A E represents the relationship from u
to v, where u; v A V are the nodes of graph G, and the weight of
edge eu; v is denoted by wu; v A W.
For a node V in the graph G, the sets of its in-neighbors and outneighbors are dened as Iv fqjq A V 4 eq; v A Eg and Ov
fpjp A V 4 ev; p A Eg respectively. The number of in-neighbors is
jIvj r n and the number of out-neighbors is jOvj r n, where n is
the node number in graph G.
In this paper, the initial weight wu; v is dened as the link
frequency of eu; v, which is the occur times of connection
between u and v that associate edge eu; v. For example, in email network, the weight of edge eu; v is the number of deliveries
from user u to user p; and in web network, the weight of edge
eu; v is the number of hyper-links of web page v contained in
web page u.
3.2. Modeling ERS
Roughly, the link importance is dened as the link frequency
between two nodes, however, it may be unreasonable since the
ERS of two edges may be very different even though their link
frequencies are same. For example, in the citation network, the
citation relationship between a paper on graph mining and a
paper on similarity would be weaker than the citation relationship between two papers on the same topic similarity, even
though the link frequency of each citation is 1. For distinguishing
link importance, in this subsection, we dene ERS that is a
measure on how strength is the connection?. In the following,
three factors are discussed intuitively for modeling ERS, including
node activity, node attraction and link frequency, based on which
we propose the ERS computation formula.
150
3
case, the complexity is O n, which is also denoted by On2 since
is a constant. Many scale-free networks, such as e-mail networks
ERSa; p
dx A a ERSa; x
wu; v2
P
ERSu; v P
p A Ou wu; p
q A Iv wq; v
By this formula, the ERS for each edge can be computed, and
then the importance of each link can be distinguished. After
getting the ERS for each edge, the weight of each edge would be
replaced with ERS, i.e., wu; v ERSu; v for all eu; v A E.
For a given graph with m edges and n nodes, we can easily
2
derive that the time cost for computing ERS is Omd , where d is
the average degree (in-degree or out-degree). Many networks
generally have sparse degree distribution, which makes us to
derive the time complexity in the average case. One type of
network is the random graph studied in [7], the degree of such
graph has a Poisson distribution, that the probability of a node has
degree is p =!e , and then we have E . In this
p A aq A b
151
where m =C l m l.
3: For a A V do
4:
For b A V do
5:
if a b then
6:
Rl a; b1;
7:
else
P
P
8:
Rl a; bC p A a q A b Pa; pPb; qRl 1 p; q;
9:
end if
10:
end for
11: end for
12: end for
4. Optimized esimRank
6:
7:
8:
3:
4:
l 1 =C L l 1 L;
For a A V do
l 1
5:
l1
T, essential, epartiala
n0;
For p A a do
Rl l 11 a; b to denote the similarity between a and b with considering l 1 that is treated as a threshold at iteration l 1. And then
Rl l 11 a; b C
l
Pb; qepartiala
q
p A b
l
q
epartiala
X
p A a
Pa; pRl l p; q
Theorem 2. For a; b A V, parameter C A 0; 1, iteration l, and threshold , the following estimations hold
1. Rl a; b Rl l a; b r
l
P
C l m m
l
P
C l m m
m1
l1
2. sima; b Rl l a; b r C
m1
9:
l 1
l1
l1
epartialOa
qepartiala
q Pa; pRl l 11 p; q;
10:
11:
12:
13:
14:
15
16:
17:
18:
end for
end for
For q A T do
for b A Iq [ fqg do
essentialessential [ fbg;
end for
end for
For bA essential do
if a b then
19:
20:
21:
Rl l a; b1, continue;
end if
For q A b do
22:
l 1
l1
Rl l a; bRl l a; b Pb; qepartiala
q;
23:
end for
24:
end for
25: end for
26: end for
152
5. Experimental study
All our experiments are performed on an Intel(R) Xeon (R) CPU
2.27 GHz, 8 GB memory, running Windows Server 2008. All algorithms were implemented in C and compiled using VS 2010.
5.1. Setup
5.1.1. Datasets
We ran our experiments on the following three real datasets:
Q
1X
v
Q i 1 A;k i
where Q is the set of queries chosen from the graph and a is the
similarity measure. A;k vi is the average precision of the returned
1
2
3
http://snap.stanford.edu/data/cit-HepPh.html
http://www.sfu.ca/
http://www.cs.cmu.edu/ enron/
A;k vi
l
1X
v ; v
k j 1 A;k i j
Fig. 2. Distribution of node degree on each dataset. (a) Citation network, (b) web network, (c) e-mail network.
Fig. 3. MAP score change in citation network. (a) MAP score on varying l, (b) MAP score on varying k.
Fig. 4. MAP score change in web network. (a) MAP score on varying l, (b) MAP score on varying k.
153
154
5.3. Efciency
5.3.1. Running time of similarity computation
Table 2 shows the running time of similarity computation. On
each dataset, PSimRank is the most time-consuming, since it
involves some set of operations while computing similarity.
ESimRank needs more time than other methods except PSimRank,
this is because it is more comprehensive and naturally requires
more operations for computing the meeting probabilities of any
path length. Fortunately, the Opt-ESimRank is signicantly efcient than the naive ESimRank, the time of Opt-ESimRank is only
4.06% of the ESimRank on citation network, 52.70% on web
network and 9.68% on e-mail network respectively. The improvement on web network is not so evident, this is because the node
degree of the web network is lower than other networks, and the
improvement of the optimization technology is more evident as
analyzed. Although it requires more time than Opt-SimRank, the
time cost is still signicantly lower than other methods. From this
result, we conclude that the time cost of our proposed approach is
comparable to existing similarity measures.
Table 1
MAP score comparison of different factors.
Dataset
Link frequency
Node attraction
Node activity
ERS
Citation network
Web network
E-mail network
0.0452147
0.0103911
0.0107991
0.0452147
0.0103911
0.0108843
0.0464276
0.0133236
0.0150150
0.0452446
0.0104690
0.0146060
0.0470685
0.0143902
0.0148843
Fig. 5. MAP score change in e-mail network. (a) MAP score on varying l, (b) MAP score on varying k.
155
Table 2
Running time of similarity computation (min).
Dataset
ESimRank
Opt-ESimRank
SimRank
Opt-SimRank
PSimRank
SimFusion
SimRankn
Citation network
Web network
E-mail network
68.76
21.54
36.37
2.79
11.35
3.52
32.11
12.98
16.04
1.90
3.58
0.31
238.53
153.71
205.75
33.02
13.51
23.63
29.09
14.90
13.23
Dataset
ESimRank
Opt-ESimRank
SimRank
Opt-SimRank
PSimRank
SimFusion
SimRankn
Citation network
Web network
E-mail network
9525.65
20,827.63
6775.18
2368.59
2112.66
463.92
5848.58
19,375.96
152.71
3377.65
19,248.70
150.82
5848.58
19,375.96
152.71
4380.14
19,372.48
144.00
9502.77
20,802.60
6775.18
Table 3
Similarity matrix size (K).
Table 4
MAP score comparison by exploiting in- and out-relationships.
Dataset
ENS-ESimRank
ENS-SimRank
P-Rank
Citation network
Web network
E-mail network
0.0860502
0.0206286
0.0152735
0.0690213
0.0131910
0.0141921
0.0634202
0.0058847
0.0082692
156
Acknowledgments
This work was supported by National Science Foundation of
China Grants 61170007, 61170006 and 61202376; Bidding Program
of Shanghai Research Institute of Publishing and Media Grants
SAYB1410; Young University Teacher Funding Program of Shanghai
Municipal Education Commission Grants ZZSLG14021; Innovation
Program of Shanghai Municipal Education Commission Grants
15ZZ074, 13YZ075 and 13ZZ111; and National High-tech R&D
Program of China (863 Program) Grants SS2015AA011809.
References
[1] H. Abusaimeh, M. Shkoukani, F. Alshrouf, Balancing the network clusters for
the lifetime enhancement in dense wireless sensor networks, Arab. J. Sci. Eng.
39 (5) (2014) 37713779.
[2] R. Amsler, Application of citation-based automatic classication, Technical
Report, The University of Texas at Austin Linguistics Research Center, December 1972.
[3] Y. Cai, S. Chakravarthy, Pairwise similarity calculation of information networks,
in: Proceedings of Data Warehousing and Knowledge Discovery13th International Conference, DaWaK 2011, Toulouse, France, August 29September 2,
2011, pp. 316329.
[4] Y. Cai, M. Zhang, C. Ding, S. Chakravarthy, Closed form solutions of similarity
algorithms, in: SIGIR, 2010, pp. 709710.
[5] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har'El, I. Ronen, E. Uziel,
S. Yogev, S. Chernov, Personalized social search based on the user's social
network, in: CIKM, 2009, pp. 12271236.
[6] A. Clauset, C.R. Shalizi, M.E.J. Newman, Power-law distributions in empirical
data, SIAM Rev. 51 (4) (2009) 661703.
[7] P. Erdss, A. Rnyi, On random graphs, Publ. Math. Debr. 6 (1959) 290297.
[8] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the
internet topology, in: SIGCOMM, 1999, pp. 251262.
[9] D. Fogaras, B. Rcz, Scaling link-based similarity search, in: WWW, 2005,
pp. 641650.
[10] J. Gehrke, P. Ginsparg, J.M. Kleinberg, Overview of the 2003 kdd cup, SIGKDD
Explor. 5 (2) (2003) 149151.
[11] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, S. Ofek-Koif,
Personalized recommendation of social software items based on social
relations, in: RecSys, 2009.
[12] G. Jeh, J. Widom, Simrank: a measure of structural-context similarity, in: KDD,
2002, pp. 538543.
[13] G. Jeh, J. Widom, Scaling personalized web search, in: WWW, 2003, pp.
271279.
[14] M.M. Kessler, Bibliographic coupling between scientic papers, Am. Doc. 14
(1963) 1025.
[15] K. Khan, B.B. Baharudin, A. Khan, Semantic-based unsupervised hybrid
technique for opinion targets extraction from unstructured reviews, Arab. J.
Sci. Eng. 39 (5) (2014) 36813689.
[16] P. Lee, L.V.S. Lakshmanan, J.X. Yu, On top-k structural similarity search, in:
ICDE, 2012, pp. 774785.
[17] J. Leskovec, J.M. Kleinberg, C. Faloutsos, Graphs over time: densication laws,
shrinking diameters and possible explanations, in: KDD, 2005, pp. 177187.
[18] J. Leskovec, K.J. Lang, A. Dasgupta, M.W. Mahoney, Community structure in
large networks: Natural cluster sizes and the absence of large well-dened
clusters, 2008, CoRR abs/0810.1355.
[19] C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, T. Wu, Fast computation of simrank for
static and dynamic information networks, in: EDBT, 2010, pp. 465476.
[20] P. Li, Y. Cai, H. Liu, J. He, X. Du, Exploiting the block structure of link graph for
efcient similarity computation, in: PAKDD, 2009, pp. 389400.
[21] W. Lin, X. Kong, P.S. Yu, Q. Wu, Y. Jia, C. Li, Community detection in incomplete
information networks, in: WWW, 2012, pp. 341350.
[22] Z. Lin, I. King, M.R. Lyu, Pagesim: a novel link-based similarity measure for the
world wide web, in: WI, 2012, pp. 687693.
[23] Z. Lin, M.R. Lyu, I. King, Extending link-based algorithms for similar web pages
with neighborhood structure, in: 2007 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007, 25 November 2007, Silicon Valley, CA,
USA, Main Conference Proceedings, 2007, pp. 263266.
[24] Z. Lin, M.R. Lyu, I. King, Matchsim: a novel neighbor-based similarity measure
with maximum neighborhood matching, in: Proceedings of the 18th ACM
Conference on Information and Knowledge Management, CIKM 2009, Hong
Kong, China, November 26, 2009, pp. 16131616.
[25] D. Lizorkin, P. Velikhov, M.N. Grinev, D. Turdakov, Accuracy estimate and
optimization techniques for simrank computation, Proc. VLDB Endow. 1 (1)
(2008) 422433.
[26] D. Lizorkin, P. Velikhov, M.N. Grinev, D. Turdakov, Accuracy estimate and
optimization techniques for simrank computation, VLDB J. 19 (1) (2010)
4566.
[27] M. Moricz, Y. Dosbayev, M. Berlyant, Pymk: friend recommendation at
myspace, in: SIGMOD, 2010, pp. 9991002.
[28] J.-Y. Pan, H.-J. Yang, C. Faloutsos, P. Duygulu, Automatic multimedia crossmodal correlation discovery, in: KDD, 2004, pp. 653658.
[29] D. Snchez, M. Batet, A semantic similarity method based on information
content exploiting multiple ontologies, Expert Syst. Appl. 40 (4) (2013)
13931399.
[30] D. Snchez, M. Batet, D. Isern, A. Valls, Ontology-based semantic similarity: a
new feature-based approach, Expert Syst. Appl. 39 (9) (2012) 77187728.
[31] H.G. Small, Co-citation in the scientic literature: a new measure of the
relationship between two documents, J. Am. Soc. Inf. Sci. 24 (4) (1973)
265269.
[32] Y. Sun, J. Han, X. Yan, P.S. Yu, T. Wu, Pathsim: meta path-based top-k similarity
search in heterogeneous information networks, PVLDB 4 (11) (2011)
9921003.
[33] Y. Sun, Y. Yu, J. Han, Ranking-based clustering of heterogeneous information
networks with star network schema, in: KDD, 2009, pp. 797806.
[34] W. Xi, E.A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, D. Zhuang, Simfusion:
measuring similarity using unied relationship matrix, in: SIGIR, 2005,
pp. 130137.
[35] X. Yin, J. Han, P.S. Yu, Linkclus: efcient clustering via heterogeneous semantic
links, in: VLDB, 2006, pp. 427438.
[36] W. Yu, X. Lin, Irwr: incremental random walk with restart, in: SIGIR, 2013,
pp. 10171020.
[37] W. Yu, X. Lin, W. Zhang, L. Chang, J. Pei, More is simpler: effectively and
efciently assessing node-pair similarities based on hyperlinks, PVLDB 7 (1)
(2013) 1324.
[38] M. Zhang, Z. He, H. Hu, W. Wang, E-rank: a structural-based similarity
measure in social networks, in: WI, 2012, pp. 415422.
[39] M. Zhang, H. Hu, Z. He, W. Wang, Top-k similarity search in heterogeneous
information networks with x-star network schema, in: Expert Systems with
Applications, 2015, http://dx.doi.org/10.1016/j.eswa.2014.08.039.
[40] P. Zhao, J. Han, Y. Sun, P-rank: a comprehensive structural similarity measure
over information networks, in: CIKM, 2009, pp. 553562.
[41] X. Zhao, C. Xiao, X. Lin, Q. Liu, W. Zhang, A partition-based approach to
structure similarity search, PVLDB 7 (3) (2013) 169180.
Hao Hu is currently a Ph.D. candidate in Fudan University, Shanghai, China. He received his B.S degree of
Information Security from Fudan University in 2010.
His current research interests include keyword search
and knowledge.
157
Liujie Sun is a professor at University of Shanghai for
Science and Technology, Shanghai, China. He received
his B.S. in the Huazhong University of Science and
Technology (1986), Wuhan, China; M.S. in University
of Shanghai for Science and Technology (1989), Shanghai, China; Ph.D. in University of Shanghai for Science
and Technology (2008), Shanghai, China. His main
research interests are on media data management
and multimedia technology.