10.1016 J.neucOM.2015.04.084 A Comprehensive Structural Based Similarity Measure in Directed Graphs

Neurocomputing 167 (2015) 147157
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
A comprehensive structural-based similarity measure

in directed graphs
Mingxi Zhang a,b,n, Hao Hu b, Zhenying He b, Liping Gao a,b, Liujie Sun a
a
b
University of Shanghai for Science and Technology, 516 Jungong Road, Shanghai 200093, China
Fudan University, 825 Zhangheng Road, Shanghai 201203, China
art ic l e i nf o
a b s t r a c t
Article history:
Received 17 October 2014
Received in revised form
14 March 2015
Accepted 29 April 2015
Communicated by Liang Lin
Available online 15 May 2015
Computing similarity between two nodes in directed graphs plays an increasingly important role in
various research elds, including clustering, collaborative ltering and community mining. Many
similarity measures have been devoted in recent years, such as SimRank, PSimRank and SimFusion.
However, these measures consider only the expected meeting probability of equal path length, which
may omit some latent similar nodes. Besides, the link importance of each edge is not distinguished,
which may lead to unreasonable rankings while searching similar nodes. In this paper, we propose an
effective structural-based similarity measure, ESimRank, for effectively computing similarities in
directed graphs. We rstly dene effective relationship strength (ERS) to distinguish link importance
by utilizing node activity, node attraction and link frequency. And then we formalize ESimRank equation
by combining ERS and the expected meeting probabilities of any path length. Compared to existing
similarity measures, ESimRank can nd more latent similar nodes and give ranking of better quality. For
supporting fast similarity computation, we develop an extended partial sums-based algorithm, which
reduces the time complexity signicantly. Extensive experiments demonstrate the effectiveness and
efciency of ESimRank by comparing with the state-of-the-art similarity measures.
& 2015 Elsevier B.V. All rights reserved.
Keywords:
Similarity
Effective relationship strength
ESimRank
1. Introduction
Many real networks can be represented as directed graphs, for
example, e-mail network where a node represents an e-mail user and
an edge implies the delivery relationship that an e-mail being send
from one user to another; citation network where a node represents a
paper and an edge implies the citation relationship between papers;
web network where a node represents a web page and an edge
implies the hyper-link one web page points to another. With these
networks becoming massive and diverse, the requirement for discovering valuable knowledge from directed graphs has become a
signicant task. Similarity computation between nodes in directed
graph [2,9,12,14,31,34,38] is one important aspect of network analysis,
which is required by many real applications to evaluate the underlying
similarity between nodes, including clustering [1,33,35], community
mining [18,21] and recommendation system [5,11,27]. For satisfaction
of the above requirement, an effective similarity evaluation function is
required for answering the question How similar are these two
n
Corresponding author at: College of Communication and Art Design, University
of Shanghai for Science and Technology, 516 Jungong Road, Shanghai 200093, China.
E-mail addresses: mingxizhang10@fudan.edu.cn (M. Zhang),
huhao@fudan.edu.cn (H. Hu), zhenying@fudan.edu.cn (Z. He),
lipinggao@fudan.edu.cn (L. Gao), liujiesunx@163.com (L. Sun).
http://dx.doi.org/10.1016/j.neucom.2015.04.084
0925-2312/& 2015 Elsevier B.V. All rights reserved.
nodes? or Which other nodes are most similar to this one?. It is

non-trivial to compute similarity in a comprehensive fashion, this task
would become more challenging when the real graph data to be
examined is large and complex.
Many similarity measures have been devoted in previous work,
such as SimRank [12], PSimRank [9] and SimFusion [34]. However,
these measures consider only the meeting expected probability of
equal path length while computing similarity, which would omit
some latent similar nodes and give incomprehensive results. Besides,
these measures do not distinguish the link importance for each edge,
which would lead to unreasonable rankings while searching similar
nodes. Although some optimization techniques on similarity computation have been proposed [19,25,22,37,3941], they are particularly
inefcient in practice since most of them do not pay enough
attention to both link importance and meetings of different path
lengths. When applying them to real graphs, some unpromising
results would be brought due to the above two drawbacks.
Example 1. SimRank is a typical similarity measure of this kind,
which denes the similarity as the expected meeting probability of
equal path length, and the link importance for each edge is simply
considered to be equal. Next we take SimRank as an example to
analyze these two drawbacks on two fractions of citation network,
which is shown in Fig. 1, where each node represents a paper and an
148
M. Zhang et al. / Neurocomputing 167 (2015) 147157
edge represents a citation between papers. As shown in Fig. 1(a),

according to SimRank, the similar node list to a given query p2 is
ranked as fp6 ; 0:64g, which contains only p6. This is because
SimRank considers only the meetings of equal path length, and only
p6 can meet with p2 when they walk along the equal length paths
p6 -p7 -p4 and p2 -p3 -p4 . Obviously, it is not reasonable since
some similar papers may be omitted. We assume p2 is a paper titled
Scaling link-based similarity search, p3 is a paper titled SimRank: A
measure of structural-context similarity, and p4 is a paper titled Cocitation in the scientic literature: A new measure of the relationship
between two documents. In practice, there exists citation relationship between p2 -p3 and p3 -p4 , which is consistent with the
fraction of this gure. Obviously, p3 is similar to p2, since both of
them are on the topic similarity computation. However, p3 is not in
the returned list, since the length of path p3 -p4 is not equal to path
p2 -p3 -p4 , which is not consistent with our analysis.
Fig. 1(b) shows the returned result list of SimRank for a given query
p14, which is ranked as fp11 ; 0:4; p12 ; 0:4; p13 ; 0:4; p15 ; 0:2g. In
this gure, p8 is a paper cited by many papers such as a survey, hence
it is very usual that it was cited by other papers that are on different
sub-topics. And then the link importance from p14 to p8 should be
weaker than the link importance from p14 to p9 . Other edges can be
analyzed similarly. Hence the similarity between p14 and p13 should
be lower than similarity between p14 and p15 . From which, we can see
that the results of SimRank are different from our intuition.
Motivated by above discussions, in this paper, we propose a
structural-based similarity measure, ESimRank, to comprehensively
compute the similarity between nodes in a directed graph. The link
importance for each edge is distinguished by using effective relationship strength (ERS), which is dened by utilizing three main factors,
including node activity, node attraction and link frequency. And then
we formalize the ESimRank equation by integrating ERS and the
expected meeting probabilities of a different path length into similarity computation. Compared to existing methods, ESimRank can nd
more latent similar nodes and give reasonable rankings while searching similar nodes. For supporting fast similarity computation, we
develop an extended partial sums-based computation algorithm,
which reduces the time cost of similarity computation signicantly.
Extensive experiments on real datasets demonstrate the effectiveness
and efciency of our approach through comparing with the state-ofthe-art similarity measures.
The rest of this paper is organized as follows. Section 2
discusses the related work. Section 3 proposes ERS computation
formula, gives the ESimRank equation and the ESimRank computation algorithm. Section 4 re-writes the ESimRank equation and
develops an optimized ESimRank computation algorithm. Experimental studies are reported in Section 5. Section 6 concludes this
paper and discusses the future work.
2. Related work
Due to the practical signicance of similarity computation in
directed graphs, many approaches have been devoted in recent
p4
p3
p2
p1
p7
p6
p5
Rank
1
2
3
4
5
6
SimRank
(p6,0.640)
N/A
N/A
N/A
N/A
N/A
years. With respect to the focus of this paper, below we briey

describe the work on similarity computation that is most relevant
to the current work.
Some early approaches are co-citation [31] and bibliographic
coupling [14]. Co-citation measures the similarity between two
papers based on the common papers which cite both of them,
formally, the similarity between paper a and paper b is dened as
the number of papers which cites both a and b, while bibliographic
coupling denes similarity as the number of papers cited by both a
and b. However, these similarity measures utilize only one-step
neighbors, some latent similar nodes would be neglected.
In recent years, some multi-step neighbors-based similarity
measures have been devoted. SimRank [12] is a renowned
structural-based similarity measure and has been widely used in
various elds [9,28,35]. SimRank denes the similarity between
two nodes as the expected distance for two random surfers when
they walk along the network backwards, based on the intuition
that objects are similar if they are referenced by similar objects.
PSimRank [9] utilizes nger-print trees, coupled random walks
generation and parallelization probabilities for similarity computation, and the node importance is considered by weakening the
unimportant node. P-PageRank [13] uses random walk distance to
measure the relativity or nearness between nodes, which can be
applied to directed networks and undirected networks. PageSim
[22] is based on PageRank score propagation through link paths,
which measures the similarity by considering the count of the
paths between two nodes. Lin et al. [24] proposed a neighborbased similarity measure, called MatchSim, which utilizes the
neighborhood structure for similarity computation, specically,
the similarity between web pages is dened by the average
similarity of the maximum matching between their neighbors.
BlockSimRank [20] partitions the graph into several blocks based
on the block structure of graph data, by which the similarity of
each node-pair in the graph can be obtained efciently. TopSim
[16] uses SimRank to nd top-k similar nodes in a given graph,
which transforms the top-k SimRank problem on a graph G to one
of the nding of the top-k nodes with highest authority on the
product graph G G.
There are some similarity measures which address the heterogeneity of real networks. SimFusion [34] uses unied relationship
matrix (URM) to represent the heterogeneous web objects and the
interrelationships among these web objects. The similarity matrix
is computed iteratively over URM, which helps overcome the data
sparseness problem and detect the latent relationships among
heterogeneous data objects. In fact, SimFusion can be easily
transformed into SimRank with minor modication [4]. NetSim
[39] measures similarity between nodes by utilizing the attribute
similarities that is computed according to global structure information. Different from SimFusion, NetSim denes the link importance of different types automatically, while SimFusion denes the
relation importance manually. For both SimFusion and NetSim,
although the link weights are considered for computing similarity,
they are on the heterogeneous networks, which address the
heterogeneity of graphs by effectively combining relationships
p11
p12
p13
p14
p15
p8
p9
p10
Rank
1
2
3
4
5
6
SimRank
(p11,0.400)
(p12,0.400)
(p13,0.400)
(p15,0.200)
N/A
N/A
Fig. 1. Example of directed graphs: two fractions of citation network. (a) Returning result for given query node p2 , (b) returning result for given query node p14 .
from multiple heterogeneous data sources. PathSim [32] is a meta

path-based similarity measure, that captures the subtle similarity
semantics among peer objects in networks. However, PathSim
requires user to provide a meta path, which is difcult for users
especially when they do not know about the schema of the
heterogenous networks clearly.
Although the above similarity measures utilize the indirect
connections for similarity computation, most of them do not
consider the meetings of different path lengths. Due to the
sparseness of real graph data, some latent similar nodes would
be neglected. Besides, these measures fail to distinguish link
importance between nodes, which may bring ranking list of lower
quality while searching similar nodes. As a result, the effectiveness
of returned result would be decreased, which would become an
obstacle for its application to real graph data. Compared to these
methods, ESimRank integrates the meetings of any path length
and link importance into similarity computation. The link importance is distinguished by using ERS that is dened by utilizing
node activity, node attraction and link frequency, and the similarity is dened as the expected meeting probabilities, no matter
whether the path lengths are equal.
For optimizing similarity computation, some techniques have
been developed recently. Lizorkin et al. [25] optimized SimRank by
essential node pairs, partial sums and threshold-sieved, the
2
computation complexity is reduced from On2 d in the worst case
to On2 d, which is further reduced to minOn2 d; On3 =log 2 n by
using cross summation [26], where n is the node number of a
given graph and d is the average degree. Lin et al. [19] introduced a
no-iterative SimRank computation method in dynamic networks,
which rewrite SimRank into a non-iterative form based on the
Kronecker product and vectorization operators. Zhao et al. [41]
proposed a partition-based approach to tackle the efciency
problem of SimRank, by dividing the data graphs into variablesize non-overlapping partitions. However, these optimization
techniques mainly focus on improving the computation efciency,
which do not pay much attention to the above two drawbacks.
Yu et al. [37] proposed SimRankn for resolving the counterintuitive zero-similarity issues and inherits merits of the basic
SimRank philosophy. SimRankn considers the meetings of different path paths for computing similarities and uses induced graph
for speeding up a computation procedure. Different from this
method, our approach distinguishes the link importance by ERS
which is modeled based on the link frequency, node activity and
node attraction. These factors are effective for improving the
quality of returned result.
Some optimization techniques exploit both in- and out- relationships for similarity computation. Amsler [2] is an early
similarity measure of this kind, which integrates both co-citation
[31] and bibliographic coupling [14] into similarity computation.
However, Amsler considers only one-step neighbors, which would
neglect some latent similar nodes. Cai and Chakravarthy [3]
proposed pairwise similarity calculation method in information
networks by extending SimRank, in which both in- and out-link
relationships are considered. P-Rank [40] enriches SimRank by
jointly encoding both in- and out-link relationships into structural
similarity computation. The intuition behind P-Rank is that two
objects are similar if (1) they are referenced by similar objects and
(2) they reference similar objects. P-Rank utilizes the similarities
conveyed from both in- and out-link directions, which resolves the
limited information problem as mentioned by Jeh and Widom
[12]. Different from P-Rank, ESimRank distinguishes the link
importance by using ERS, while P-Rank considers the link importance of each edge to be equal. Besides, ESimRank considers the
meetings of any path lengths, while P-Rank considers only the
meetings of equal path lengths. For example, P-Rank does not
consider the meeting of a and b when they walk along the path
149
a-v0 -v1 -xb and arrive at common node v0 , while this case is
considered by ESimRank.
Lin et al. [23] proposed a model, called the Extended Neighborhood Structure (ENS), which utilizes both in- and out-relationships
to extend existing similarity measures. Based on ENS, most of
existing similarity measures can be extended, such as SimRank,
SimFusion and our proposed ESimRank. ENS is a general framework of similarity computation, the effectiveness of ENS depends
on the similarity measures which are extended. Although ESimRank also considers in- and out-links for similarity computation, it
is different to ENS, since these two factors are exploited for
dening ERS which is used for distinguishes link importance.
ESimRank is not conict with ENS in practice, since ENS can be
easily integrated into ESimRank for further improving the performance of similarity computation.
3. ESimRank
As mentioned above, the existing similarity measures consider
only the expected meeting probability of equal path length and do
not give enough attention to link importance, which may decrease
the effectiveness of returned result. For overcoming these two
drawbacks, in this section we propose a comprehensive similarity
measure ESimRank by utilizing ERS for distinguishing link importance and considering the meetings of any path length.
3.1. Preliminaries
Before we further discuss on similarity computation, we rst
give the denition of directed graph for the subsequent
discussions.
Denition 1 (Directed graph). A directed graph is noted by
GV ; E; W, where V is a set of nodes, E is a set of links and W is
a set of weights. Edge eu; v A E represents the relationship from u
to v, where u; v A V are the nodes of graph G, and the weight of
edge eu; v is denoted by wu; v A W.
For a node V in the graph G, the sets of its in-neighbors and outneighbors are dened as Iv fqjq A V 4 eq; v A Eg and Ov
fpjp A V 4 ev; p A Eg respectively. The number of in-neighbors is
jIvj r n and the number of out-neighbors is jOvj r n, where n is
the node number in graph G.
In this paper, the initial weight wu; v is dened as the link
frequency of eu; v, which is the occur times of connection
between u and v that associate edge eu; v. For example, in email network, the weight of edge eu; v is the number of deliveries
from user u to user p; and in web network, the weight of edge
eu; v is the number of hyper-links of web page v contained in
web page u.
3.2. Modeling ERS
Roughly, the link importance is dened as the link frequency
between two nodes, however, it may be unreasonable since the
ERS of two edges may be very different even though their link
frequencies are same. For example, in the citation network, the
citation relationship between a paper on graph mining and a
paper on similarity would be weaker than the citation relationship between two papers on the same topic similarity, even
though the link frequency of each citation is 1. For distinguishing
link importance, in this subsection, we dene ERS that is a
measure on how strength is the connection?. In the following,
three factors are discussed intuitively for modeling ERS, including
node activity, node attraction and link frequency, based on which
we propose the ERS computation formula.
150
3.2.1. Node activity

The activity of a node is concerned with the set of its outneighbors. A node of high activity is the node which points to
many nodes. For example, a highly active paper in citation network is the paper which cites many papers; a highly active user in
e-mail network is the user who send many emails. We can also
describe the node of low activity similarly.
Usually, if the node activity is high, it would be very usual that
a node is pointed by this node, therefore, the ERS of its out-links
should be relatively weaker, and vice versa. For example, in an email network, if a sender is a public e-mail user for sending
advertisements, the activity of this user is high since it has send a
large number of e-mails to many other users, and then the delivery
relationship from this sender to a receiver should be relatively
weaker; and in a citation network, a survey on data mining usually
cites many papers from different sub-topics concerning data
mining, therefore the citation relationship from this survey to
the cited papers should be relatively weaker as well.
3.2.2. Node attraction
The attraction of a node is concerned with the set of its inneighbors. A highly attractive node is the node which is pointed by
many other nodes. For example, a highly attractive paper in
citation network is the paper which is cited by many papers; a
highly attractive user in e-mail network is the user who received
many e-mails. Similarly, there is also low attractive node.
If the attraction of a node is high, it would be usual that other
nodes point to this node, therefore, the ERS of its in-links should
be relatively weaker, and vice versa. For example, in an e-mail
network, if the sender is a public e-mail user for collecting
suggestions, it would be very usual that it received an e-mail from
many other users, so the delivery relationship from the senders to
this receiver should be relatively weaker.
3.2.3. Link frequency between nodes
When the node activity is certain, if the link frequency is
higher, and the ERS would be stronger, and vice versa. For
example, in e-mail network, the user who has received many emails from the current user should be more important, and the
ERS of delivery relationship should be stronger as well. On the
other hand, when the node attraction is certain, if the link
frequency is higher, and the ERS would be stronger, and vice
versa. For example, the user who has send many e-mails to the
current user should be more important, and the ERS of delivery
relationship should be stronger as well, and vice versa.
3
case, the complexity is O n, which is also denoted by On2 since
is a constant. Many scale-free networks, such as e-mail networks
and web networks, follow a power-law degree distribution [8]. The

a
probability that a node has degree is , where a is the
exponent. The expected value of degree is E a=a 1. In this
case, the complexity is again On.
3.3. ESimRank equation
In this subsection, we formalize the ESimRank equation that
computes the similarity between two nodes by combing the ERS and
the meetings of any path length. The intuition of ESimRank is that
two nodes may be similar if they can meet at common nodes, even
though they walk along the path of different path lengths. Based on
the intuition, we use the expected meeting probability for measuring
similarity, which measures how soon two random surfers are
expected to meet at the common nodes when they start at two
nodes respectively over a directed graph. For a directed graph G with
ERS, the similarity under ESimRank between a A V and b A V is
initialized as sima; b 1 if a b, otherwise:
X X
sima; b C
Pa; pPb; qsimp; q
2
p A aq A b
where C is a decay factor between 0 and 1, a Oa [ fag and

b Ob [ fbg,
Pa; p P
ERSa; p
dx A a ERSa; x
is the transition probability with considering ERS from a to p. Here

parameter C is used for ensuring the convergence of ESimRank, and
there is no difference in the rankings on different values when tuning
parameter C, although there were differences in absolute magnitudes
of scores [12]. Through analyzing the ESimRank equation, we can
easily conclude that ESimRank utilizes the expected meeting probabilities of any path length and integrates the ERS effectively for
computing similarities.
3.4. ESimRank computation algorithm
By adopting SimRank computation procedure [12], we compute
ESimRank scores as follows. We rstly give the iterative computation equation Rl n; n at iteration l. The iterative computation is
started with R0 n; n, and the similarity between a A V and b A V is
initialized as

0 if a a b
3
R0 a; b
1 if a b
3.2.4. Computing ERS

Based on above intuitive analysis, the ERS that corresponds to
edge eu; v is denoted by ERSu; v, which is naturally formalized
as
And when iteration l 1 4 0, the similarity between a A V and

bA V is dened as Rl 1 a; b 1 if a b, otherwise:
X X
Pa; pPb; qRl p; q
4
Rl 1 a; b C
wu; v2
P
ERSu; v P
p A Ou wu; p
q A Iv wq; v
The computation procedure of ESimRank scores is shown in

Algorithm 1. At each iteration, the similarity between node a A V
and node b A V is computed by Eq. (4). There are totally n2 node
pairs in graph G, so the time cost for computing the similarity
2 2
2
matrix can be derived as Oln d , also denoted by On2 d .
By this formula, the ERS for each edge can be computed, and
then the importance of each link can be distinguished. After
getting the ERS for each edge, the weight of each edge would be
replaced with ERS, i.e., wu; v ERSu; v for all eu; v A E.
For a given graph with m edges and n nodes, we can easily
2
derive that the time cost for computing ERS is Omd , where d is
the average degree (in-degree or out-degree). Many networks
generally have sparse degree distribution, which makes us to
derive the time complexity in the average case. One type of
network is the random graph studied in [7], the degree of such
graph has a Poisson distribution, that the probability of a node has
degree is p =!e , and then we have E . In this
p A aq A b
Algorithm 1. ESimRank computation algorithm.

Input:
Graph GV; E; ERS, decay factor C A 0; 1 and the iteration
number L;
Output:
ESimRank score RK a; b 8 a; bA V;
1: Initialize R0 a; b 8 a; b A V by Eq. (3);
2: For l 1 to L do
151
where m =C l m l.
3: For a A V do
4:
For b A V do
5:
if a b then
6:
Rl a; b1;
7:
else
P
P
8:
Rl a; bC p A a q A b Pa; pPb; qRl 1 p; q;
9:
end if
10:
end for
11: end for
12: end for
Theorem 2 can proof similarly by the proof of optimized

SimRank in [25].
4.2. Optimized ESimRank computation algorithm
The similarity computation of ESimRank is similar to SimRank

[12], therefore, they have same properties, which are illustrated as
follows.
Theorem 1. For a; bA V, parameter C A 0; 1 and iteration l, the
iterative ESimRank equations have the following properties:
1. Rl a; b Rl b; a
2. 0 r Rl a; b rRl 1 a; b r 1
3. sima; b Rl a; b rC l 1
Theorem 1 demonstrates the symmetry property, monotonicity
property of ESimRank, and gives the maximal difference between
theoretical value and computational value. These properties can be
easily derived from ESimRank by corresponding proofs of SimRank
[12,25].
The optimized ESimRank computation procedure is shown in

Algorithm 2. Line 3 computes l 1 at iteration l by the accuracy
estimation in [25], the time cost of this step is O1. In lines 611,
for a A V, we rstly compute the extended partial sums for all the
nodes by accumulation operations on non-zero similarities and
generate set T that is used for getting set essential, and the time
cost of this step can be derived as On0 d 1, where n0 r n is the
size of non-zero entries in Rl 1 a; n. Lines 1216 generate set
essential according to T, the time cost is Oj T j d 1, where
j T j rn. Set essential is the essential set, such that the similarity
0
between a and b 2
= essential is zero. Therefore, in lines 1724, we
consider only bA essential, and compute only the similarity
between a and b, and the time cost of this step is derived as
Oj essentialj d 1, where j essentialj r n. Finally, the total time
cost of this algorithm can be derived as OLn1 n0 d 1
j T j d 1 j essentialj d 1, which is denoted by OLn2 d in
the worst case.
Algorithm 2. Optimized ESimRank computation algorithm.
Input:
Graph GV; E; ERS, decay factor C A 0; 1, upper bound of
accuracy loss and iteration number L;
Output:
ESimRank score RKK a; b; 8 a; b A V;
4. Optimized esimRank
1: Initialize R00 a; b; 8 a; b A V by Eq. (3);

2: For l 1 to L do
The naive ESimRank computation process would suffer from

some limitations on efciency when graph becoming large. For
reducing the time cost of similarity computation, in this section,
we rstly rewrite the similarity computation process using
extended partial sums and reduce the unnecessary operations on
lower similarities by setting thresholds.
6:
4.1. Rewrite ESimRank equation
7:
8:
3:
4:
l 1 =C L l 1 L;
For a A V do
l 1
5:
l1
T, essential, epartiala
n0;
For p A a do
For q A fqjRl l 11 p; q a 0g do

TT [ fqg;
l
We propose extended partial sums by integrating ERS into the

optimization technique in [25], which allows reducing the access
operations to Rl n; n required for computing Rl 1 n; n. We use
Rl l 11 a; b to denote the similarity between a and b with considering l 1 that is treated as a threshold at iteration l 1. And then
the similarity function Rl l 11 a; b over a set of thresholds

0 ; 1 ; ; l 1 is dened as follows: R00 a; b R0 a; b; R11 a; b
R1 a; b, and for l 1 Z 2; Rl l 11 a; b is dened as Rl l 11 a; b
1 if a b, otherwise, Rl l 11 a; b is approximatively dened as
Rl l 11 a; b C
l
Pb; qepartiala
q
p A b
if right-hand side of Eq. (5) is bigger than l 1 , otherwise Rl l 11 0,

where
R
l
q
epartiala
X
p A a
Pa; pRl l p; q
Theorem 2. For a; b A V, parameter C A 0; 1, iteration l, and threshold , the following estimations hold
1. Rl a; b Rl l a; b r
l
P
C l m m
l
P
C l m m
m1
l1
2. sima; b Rl l a; b r C
m1
9:
l 1
l1
l1
epartialOa
qepartiala
q Pa; pRl l 11 p; q;
10:
11:
12:
13:
14:
15
16:
17:
18:
end for
end for
For q A T do
for b A Iq [ fqg do
essentialessential [ fbg;
end for
end for
For bA essential do
if a b then
19:
20:
21:
Rl l a; b1, continue;
end if
For q A b do
22:
l 1
l1
Rl l a; bRl l a; b Pb; qepartiala
q;
23:
end for
24:
end for
25: end for
26: end for
Compared to the naive ESimRank algorithm, the computation

efciency is improved signicantly. Different from the optimization algorithm in [25], this algorithm integrates ERS and the
152
meetings of different path lengths into similarity computation,

while SimRank considers only the meetings of equal path length.
Besides, the extended partial sums are computed by the accumulation operations, by which the access operations to zero similarities are signicantly reduced while computing extended
partial sums.
5. Experimental study
All our experiments are performed on an Intel(R) Xeon (R) CPU
2.27 GHz, 8 GB memory, running Windows Server 2008. All algorithms were implemented in C and compiled using VS 2010.
5.1. Setup
5.1.1. Datasets
We ran our experiments on the following three real datasets:
Citation network: The citation network is built by the high
energy physics paper dataset1 [17,10]. We get 9831 papers by

breadth rst traversal (BFT) from a randomly chosen paper and
build the citation relationship among these papers, the initial
weight of each edge is set as 1.
Web network: We build the web network by the dataset
crawled from the web site of Simon Fraser University.2 We
get 6103 selected pages by BFT from a randomly chosen page
and build the hyper-links among these pages according to the
whole web network. The initial weight of each edge is the link
frequency of the corresponding hyper-links contains in
current page.
E-mail network: The e-mail network is extracted from Enron
dataset.3 By BFT we get 9281 e-mail users and the delivery
relationship among these users, the initial weight is the
delivery frequency between users.
The out-degree distributions of these datasets are shown in

Fig. 2. We observe that the node degrees of these networks follow
power law distribution generally. By the estimation method
provided in [6], the values of are estimated as 1.5233,
2.0007, 2.6880 on citation network, web network and e-mail
network respectively.
5.1.2. Comparison methods and evaluation
We compare our proposed method with SimRank [12], optimized SimRank (Opt-SimRank) [26], PSimRank [9], SimFusion [34],
SimRankn [37], ENS [23] and P-Rank [40], which are well-dened
similarity measures. All the comparison methods are implemented
strictly following their papers. The maximal accuracy loss of
both optimized ESimRank (Opt-ESimRank) and Opt-SimRank is set
as 0.0005. And the decay factor C is set as 0.8 since there is no
difference in the rankings on different values, although there are
differences in absolute similarity scores as analyzed in [12].
We use MAP (Mean Average Precision) to evaluate the effectiveness of the similarity measures, which is dened as
MAP A;k
Q
1X
v
Q i 1 A;k i
where Q is the set of queries chosen from the graph and a is the
similarity measure. A;k vi is the average precision of the returned
1
2
3
http://snap.stanford.edu/data/cit-HepPh.html
http://www.sfu.ca/
http://www.cs.cmu.edu/ enron/
top k similar nodes for a given node vi , which is dened as
A;k vi
l
1X
v ; v
k j 1 A;k i j
where j is the rank of node vj in the returned result list, and

A;k vf i ; vj is the precision function for evaluating the effectiveness of similarity between vi and vj .
For each dataset, we randomly pick 1000 nodes as the queries
to observe the MAP change on different datasets. In citation
network, the precision evaluation function is dened as the
fraction of papers who have cited vj also cited vi , this evaluation
method has been used in [12], based on the intuition that similar
papers should have been cited by similar papers in web network,
the precision evaluation function is dened as the fraction of web
pages that contain web page vj also contain web page vi ; and in email network, the precision evaluation function A;k vi ; vj is
dened as the fraction of users who send email to vj also send
to vi .
Efciency comparison includes the running time for precomputing similarity matrix and the time cost for computing
ERS. Besides, we also test the similarity matrix size to observe
the space cost of our approach.
5.2. Effectiveness
5.2.1. MAP change on different parameter settings
Fig. 3(a) shows the MAP score change on varying iteration l in
citation network, where rank k 10. We observe that the MAP
scores of ESimRank and Opt-ESimRank are higher than other
methods from iterations 1 to 6, which demonstrate that our
approach is more effective than the comparison methods.
Although the MAP scores of SimRankn and our approach are close,
we still observe minor effectiveness improvement, this is may
because our approach considers link importance while SimRankn
not. We also observe that, the MAP scores of each method
increases with increasing l except SimFusion, even though the
increment trends of some methods are not evident. This is because
all the methods are converged except SimFusion, while SimFusion
cannot ensure a converged state [34]. The MAP score of ESimRank
and Opt-ESimRank is almost overlapped, this phenomenon
demonstrates the reasonability of our proposed optimized
approach on ESimRank. Fig. 3(b) shows the MAP score change
on varying rank k in citation network, where iteration l 6. with
increasing k, the curves show downward trend for all the four
methods, since higher-ranking nodes are more similar and then
should be closed to the given query node, and low-ranking nodes
should be relative in the rear order of the list of similar nodes,
which demonstrate the expected result. Generally, the MAP scores
of ESimRank and Opt-ESimRank are higher than the comparison
methods, although the improvement is not evident on SimRankn,
this is consistent with the result shown in Fig. 3(a).
Fig. 4(a) shows the MAP score change on varying iteration l in
e-mail network, where rank k10. The MAP scores of ESimRank
and Opt-ESimRank are higher than other methods from iterations
1 to 6, which shows the better effectiveness of our approach. The
upward trend of ESimRank and Opt-ESimRank is more evident
than the curves in citation network, and curves almost overlapped
except the point at l 5, which demonstrate the reasonability of
our optimized method as well. Different from the result on citation
network, the curves of SimRank and PSimRank are downward.
Besides, we also nd that our approach shows a more evident
improvement on SimRankn. This is because different methods may
be suit for different datasets, which is common for similarity
computation algorithms. Fig. 4(b) shows the MAP score on varying
rank k in web network, where iteration l 6. At different k, the
Fig. 2. Distribution of node degree on each dataset. (a) Citation network, (b) web network, (c) e-mail network.
Fig. 3. MAP score change in citation network. (a) MAP score on varying l, (b) MAP score on varying k.
Fig. 4. MAP score change in web network. (a) MAP score on varying l, (b) MAP score on varying k.
153
154
MAP scores of ESimRank and Opt-ESimRank are higher than the

comparison methods as well. For all the methods, we observe that
the MAP scores decrease with increasing k, which are similar to
the result in citation network and can be explained similarly.
The MAP score change on varying iteration l in e-mail network
is shown in Fig. 5(a), where rank k 10. At iteration 1, the MAP
scores of ESimRank and Opt-ESimRank are lower than other
methods except SimFusion. For l 41, MAP score of our approach
becomes higher than other methods and subsequent changes
become increasingly minor. Although the curves are not so
smoothed, the MAP scores are still higher than other methods.
Fig. 5(b) shows the MAP score on varying rank k in e-mail network,
where iteration l 6. Similar to the above datasets, the curves
show downward trend with increasing k, which can be explained
similarly. At different k, The MAP scores of ESimRank and OptESimRank are higher than the comparison methods as well.
From the MAP score comparison on different datasets, we can
conclude that (1) our approach improves the effectiveness signicantly while computing similarities; (2) the ranking quality of
our approach is good which is demonstrated by setting different
rank k; (3) the optimization technique on ESimRank works well
while speeding up the similarity computation in real graph data.
5.2.2. Effectiveness enhancement of different factors

To test the effectiveness enhancement of different factors, we
next compute ESimRank similarity scores with considering link
frequency, node attraction, node activity and ERS respectively. The
MAP scores of these methods on each dataset are shown in
Table 1, where iteration l 6 and rank k 10. The second column
corresponds to the MAP scores of ESimRank without any factors,
where the weight of each edge is set as 1. The third column shows
the MAP scores with considering link frequency, and compared to
the MAP scores without any factors. Compared to the result of
ESimRank without any factors, the improvement of MAP score is
0.00% on both citation and web network, 0.79% on e-mail network.
This is because the link frequency of each edge on citation network
is 1, and on web network, most of the link frequencies are 1 as
well, which gives no effect on similarity computation. On e-mail
network, the improvement is 0.79%, which is very minor in
practice. The fourth column corresponds to the MAP scores with

considering node attraction. From which we nd that the
improvement is 2.68% on citation network, 28.22% on web network, and 39.04% on e-mail network. The fth column corresponds to the MAP scores with considering node activity. For this
factor, the improvement is 0.07% on citation network, 0.75% on
web network and 35.25% on e-mail network and last column
corresponds to the MAP scores with considering ERS, from which
we get the improvement is 4.10% on citation network, 38.49% on
web network and 37.83% on e-mail network.
From this result, we can conclude that ERS can really enhance
the effectiveness score while computing similarities. Among
these three factors, link frequency gives little effect to similarity
computation, there is almost no improvement by utilizing this
factor. Node attraction and node activity give evident improvement on the MAP score, which are two essential factors towards
the nal performance.
5.3. Efciency
5.3.1. Running time of similarity computation
Table 2 shows the running time of similarity computation. On
each dataset, PSimRank is the most time-consuming, since it
involves some set of operations while computing similarity.
ESimRank needs more time than other methods except PSimRank,
this is because it is more comprehensive and naturally requires
more operations for computing the meeting probabilities of any
path length. Fortunately, the Opt-ESimRank is signicantly efcient than the naive ESimRank, the time of Opt-ESimRank is only
4.06% of the ESimRank on citation network, 52.70% on web
network and 9.68% on e-mail network respectively. The improvement on web network is not so evident, this is because the node
degree of the web network is lower than other networks, and the
improvement of the optimization technology is more evident as
analyzed. Although it requires more time than Opt-SimRank, the
time cost is still signicantly lower than other methods. From this
result, we conclude that the time cost of our proposed approach is
comparable to existing similarity measures.
Table 1
MAP score comparison of different factors.
Dataset
Without any factor
Link frequency
Node attraction
Node activity
ERS
Citation network
Web network
E-mail network
0.0452147
0.0103911
0.0107991
0.0452147
0.0103911
0.0108843
0.0464276
0.0133236
0.0150150
0.0452446
0.0104690
0.0146060
0.0470685
0.0143902
0.0148843
Fig. 5. MAP score change in e-mail network. (a) MAP score on varying l, (b) MAP score on varying k.
155
Table 2
Running time of similarity computation (min).
Dataset
ESimRank
Opt-ESimRank
SimRank
Opt-SimRank
PSimRank
SimFusion
SimRankn
Citation network
Web network
E-mail network
68.76
21.54
36.37
2.79
11.35
3.52
32.11
12.98
16.04
1.90
3.58
0.31
238.53
153.71
205.75
33.02
13.51
23.63
29.09
14.90
13.23
Dataset
ESimRank
Opt-ESimRank
SimRank
Opt-SimRank
PSimRank
SimFusion
SimRankn
Citation network
Web network
E-mail network
9525.65
20,827.63
6775.18
2368.59
2112.66
463.92
5848.58
19,375.96
152.71
3377.65
19,248.70
150.82
5848.58
19,375.96
152.71
4380.14
19,372.48
144.00
9502.77
20,802.60
6775.18
Table 3
Similarity matrix size (K).
5.3.2. Similarity matrix size

Table 3 shows the size of non-zero similarity matrices of
different methods. On each dataset, the similarity matrix size of
ESimRank is bigger than other methods, this is because ESimRank
is more comprehensive, more latent similar nodes are discovered
by considering the meetings of any path length, hence more space
is required for storing such similarities. The similarity size of
ESimRank and SimRankn is close, since these two methods are all
focus on nding more similar nodes. We also observe that OptESimRank decreases the similarity matrix signicantly since the
entries of lower similarities are omitted by setting threshold. The
similarity size of Opt-ESimRank is 24.81% of ESimRank on citation
network, 10.14% on web network and 6.85% on e-mail network,
which demonstrate that our proposed optimized method can
reduce the space cost signicantly. The similarity matrix size of
Opt-SimRank is lower than other methods, since the entries of
lower similarities are omitted. The similarity matrix sizes of
SimRank, PSimRank and SimFusion are close since they have the
same space complexity.
From this result we conclude that, although ESimRank spends
much space cost for storing similarity matrix, the optimized
method can reduce the similarity matrix size signicantly, which
can be efciently applied to real graph data.
5.3.3. Time cost for computing ERS
We also recorded the time cost for computing ERS on each
dataset. The time cost on citation network is 23 ms, on web
network is 15 ms, and on e-mail network is 17 ms. This result
shows that ERS is feasible to large scale networks.
5.4. Effectiveness comparison by exploiting in- and out-relationships
In this subsection, we conduct the experiment on effectiveness
comparison by exploiting in- and out-relationships. ENS is a
general framework for improving the computation effectiveness,
by which many existing measures can be extended by integrating
in- and out-relationships into similarity computation. We rstly
adopt the spirit of ENS [23] to extend ESimRank and SimRank. The
ENS-based ESimRank (ENS-ESimRank) is compared with P-Rank
and ENS-based SimRank (ENS-SimRank), where ENS-SimRank is in
fact a special case of ENS. P-Rank is not extended by ENS since the
in- and out- relationships have been integrated into P-Rank for
similarity computation [40].
The MAP scores of the returned results are shown in Table 4,
where k 10 and l 6. From which we nd that the MAP score of
ENS-ESimRank is evidently higher than P-Rank and ENS-
Table 4
MAP score comparison by exploiting in- and out-relationships.
Dataset
ENS-ESimRank
ENS-SimRank
P-Rank
Citation network
Web network
E-mail network
0.0860502
0.0206286
0.0152735
0.0690213
0.0131910
0.0141921
0.0634202
0.0058847
0.0082692
ESimRank, which is consistent with our previous discussion that

the effectiveness of ENS depends on the performance of the
similarity measure which is extended. The results demonstrate
that the spirit of ENS can be really applied to ESimRank for
further improving the effectiveness, and the improvement is
evident, which gives better effectiveness than both ENSSimRank and P-Rank.
6. Conclusion and future work

The research presented in this paper tackled the similarity
computation problem in directed graphs. A comprehensive
structural-based similarity measure ESimRank is proposed for
effectively computing similarity between nodes. Compared to
existing similarity measures, our approach computes similarity
by utilizing ERS and expected meeting probabilities of any path
length, which can nd more latent similar nodes and give better
ranking quality while searching similar nodes. For supporting fast
similarity computation, we developed an extended partial sumsbased optimized ESimRank algorithm, which reduces the unnecessary operations while computing similarity scores.
The strengths of our research can be summarized as follows.
Firstly, ESimRank distinguishes the link importance by ERS, which
gives better ranking quality while searching similar nodes. Secondly, ESimRank considers the meetings of any path length, more
latent similar nodes would be found. Thirdly, the optimization
technique on ESimRank is efcient for fast similarity computation.
Our research has theoretical and practical implications. In
theory, ESimRank is a solution that improves the effectiveness by
utilizing ERS the meeting probabilities of any path length, which
can effectively and efciently compute similarity in directed
graphs. As far as practical implications are concerned, ESimRank
can be applied to many real applications, including query expansion, clustering and web search engine, to evaluate the underlying
similarities under different real settings of graph data and satisfy
the requirements of these applications.
156
Our research also has some limitations which are summarized

below, as well as the work needs to be done in the future. Firstly,
ESimRank is on homogeneous graphs, which is not suitable for
heterogenous graphs. Accordingly, we plan to extend ESimRank to
heterogenous graph by building a general frame work for unifying
different heterogenous relationships. Secondly, our approach is on
static graphs, and the case of dynamic graphs is not addressed. For
adapting dynamic graphs, we will study the problem of increment
ESimRank computation by reference to existing method on increment similarity computation [19,36]. Finally, although this paper
considers the ERS for computing similarity, the semantic of
relationship is not considered. In the future, we will develop a
new semantic similarity measure by integrating semantic similarity [15,29,30] into ESimRank.
Acknowledgments
This work was supported by National Science Foundation of
China Grants 61170007, 61170006 and 61202376; Bidding Program
of Shanghai Research Institute of Publishing and Media Grants
SAYB1410; Young University Teacher Funding Program of Shanghai
Municipal Education Commission Grants ZZSLG14021; Innovation
Program of Shanghai Municipal Education Commission Grants
15ZZ074, 13YZ075 and 13ZZ111; and National High-tech R&D
Program of China (863 Program) Grants SS2015AA011809.
References
[1] H. Abusaimeh, M. Shkoukani, F. Alshrouf, Balancing the network clusters for
the lifetime enhancement in dense wireless sensor networks, Arab. J. Sci. Eng.
39 (5) (2014) 37713779.
[2] R. Amsler, Application of citation-based automatic classication, Technical
Report, The University of Texas at Austin Linguistics Research Center, December 1972.
[3] Y. Cai, S. Chakravarthy, Pairwise similarity calculation of information networks,
in: Proceedings of Data Warehousing and Knowledge Discovery13th International Conference, DaWaK 2011, Toulouse, France, August 29September 2,
2011, pp. 316329.
[4] Y. Cai, M. Zhang, C. Ding, S. Chakravarthy, Closed form solutions of similarity
algorithms, in: SIGIR, 2010, pp. 709710.
[5] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har'El, I. Ronen, E. Uziel,
S. Yogev, S. Chernov, Personalized social search based on the user's social
network, in: CIKM, 2009, pp. 12271236.
[6] A. Clauset, C.R. Shalizi, M.E.J. Newman, Power-law distributions in empirical
data, SIAM Rev. 51 (4) (2009) 661703.
[7] P. Erdss, A. Rnyi, On random graphs, Publ. Math. Debr. 6 (1959) 290297.
[8] M. Faloutsos, P. Faloutsos, C. Faloutsos, On power-law relationships of the
internet topology, in: SIGCOMM, 1999, pp. 251262.
[9] D. Fogaras, B. Rcz, Scaling link-based similarity search, in: WWW, 2005,
pp. 641650.
[10] J. Gehrke, P. Ginsparg, J.M. Kleinberg, Overview of the 2003 kdd cup, SIGKDD
Explor. 5 (2) (2003) 149151.
[11] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, S. Ofek-Koif,
Personalized recommendation of social software items based on social
relations, in: RecSys, 2009.
[12] G. Jeh, J. Widom, Simrank: a measure of structural-context similarity, in: KDD,
2002, pp. 538543.
[13] G. Jeh, J. Widom, Scaling personalized web search, in: WWW, 2003, pp.
271279.
[14] M.M. Kessler, Bibliographic coupling between scientic papers, Am. Doc. 14
(1963) 1025.
[15] K. Khan, B.B. Baharudin, A. Khan, Semantic-based unsupervised hybrid
technique for opinion targets extraction from unstructured reviews, Arab. J.
Sci. Eng. 39 (5) (2014) 36813689.
[16] P. Lee, L.V.S. Lakshmanan, J.X. Yu, On top-k structural similarity search, in:
ICDE, 2012, pp. 774785.
[17] J. Leskovec, J.M. Kleinberg, C. Faloutsos, Graphs over time: densication laws,
shrinking diameters and possible explanations, in: KDD, 2005, pp. 177187.
[18] J. Leskovec, K.J. Lang, A. Dasgupta, M.W. Mahoney, Community structure in
large networks: Natural cluster sizes and the absence of large well-dened
clusters, 2008, CoRR abs/0810.1355.
[19] C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, T. Wu, Fast computation of simrank for
static and dynamic information networks, in: EDBT, 2010, pp. 465476.
[20] P. Li, Y. Cai, H. Liu, J. He, X. Du, Exploiting the block structure of link graph for
efcient similarity computation, in: PAKDD, 2009, pp. 389400.
[21] W. Lin, X. Kong, P.S. Yu, Q. Wu, Y. Jia, C. Li, Community detection in incomplete
information networks, in: WWW, 2012, pp. 341350.
[22] Z. Lin, I. King, M.R. Lyu, Pagesim: a novel link-based similarity measure for the
world wide web, in: WI, 2012, pp. 687693.
[23] Z. Lin, M.R. Lyu, I. King, Extending link-based algorithms for similar web pages
with neighborhood structure, in: 2007 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2007, 25 November 2007, Silicon Valley, CA,
USA, Main Conference Proceedings, 2007, pp. 263266.
[24] Z. Lin, M.R. Lyu, I. King, Matchsim: a novel neighbor-based similarity measure
with maximum neighborhood matching, in: Proceedings of the 18th ACM
Conference on Information and Knowledge Management, CIKM 2009, Hong
Kong, China, November 26, 2009, pp. 16131616.
[25] D. Lizorkin, P. Velikhov, M.N. Grinev, D. Turdakov, Accuracy estimate and
optimization techniques for simrank computation, Proc. VLDB Endow. 1 (1)
(2008) 422433.
[26] D. Lizorkin, P. Velikhov, M.N. Grinev, D. Turdakov, Accuracy estimate and
optimization techniques for simrank computation, VLDB J. 19 (1) (2010)
4566.
[27] M. Moricz, Y. Dosbayev, M. Berlyant, Pymk: friend recommendation at
myspace, in: SIGMOD, 2010, pp. 9991002.
[28] J.-Y. Pan, H.-J. Yang, C. Faloutsos, P. Duygulu, Automatic multimedia crossmodal correlation discovery, in: KDD, 2004, pp. 653658.
[29] D. Snchez, M. Batet, A semantic similarity method based on information
content exploiting multiple ontologies, Expert Syst. Appl. 40 (4) (2013)
13931399.
[30] D. Snchez, M. Batet, D. Isern, A. Valls, Ontology-based semantic similarity: a
new feature-based approach, Expert Syst. Appl. 39 (9) (2012) 77187728.
[31] H.G. Small, Co-citation in the scientic literature: a new measure of the
relationship between two documents, J. Am. Soc. Inf. Sci. 24 (4) (1973)
265269.
[32] Y. Sun, J. Han, X. Yan, P.S. Yu, T. Wu, Pathsim: meta path-based top-k similarity
search in heterogeneous information networks, PVLDB 4 (11) (2011)
9921003.
[33] Y. Sun, Y. Yu, J. Han, Ranking-based clustering of heterogeneous information
networks with star network schema, in: KDD, 2009, pp. 797806.
[34] W. Xi, E.A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, D. Zhuang, Simfusion:
measuring similarity using unied relationship matrix, in: SIGIR, 2005,
pp. 130137.
[35] X. Yin, J. Han, P.S. Yu, Linkclus: efcient clustering via heterogeneous semantic
links, in: VLDB, 2006, pp. 427438.
[36] W. Yu, X. Lin, Irwr: incremental random walk with restart, in: SIGIR, 2013,
pp. 10171020.
[37] W. Yu, X. Lin, W. Zhang, L. Chang, J. Pei, More is simpler: effectively and
efciently assessing node-pair similarities based on hyperlinks, PVLDB 7 (1)
(2013) 1324.
[38] M. Zhang, Z. He, H. Hu, W. Wang, E-rank: a structural-based similarity
measure in social networks, in: WI, 2012, pp. 415422.
[39] M. Zhang, H. Hu, Z. He, W. Wang, Top-k similarity search in heterogeneous
information networks with x-star network schema, in: Expert Systems with
Applications, 2015, http://dx.doi.org/10.1016/j.eswa.2014.08.039.
[40] P. Zhao, J. Han, Y. Sun, P-rank: a comprehensive structural similarity measure
over information networks, in: CIKM, 2009, pp. 553562.
[41] X. Zhao, C. Xiao, X. Lin, Q. Liu, W. Zhang, A partition-based approach to
structure similarity search, PVLDB 7 (3) (2013) 169180.
Mingxi Zhang is currently a lecturer in University of

Shanghai for Science and Technology, Shanghai, China.
He received his Ph.D. of computer software and theory
from Fudan University in 2013. His current research
interests include social network analysis, information
retrieval and graph mining.
Hao Hu is currently a Ph.D. candidate in Fudan University, Shanghai, China. He received his B.S degree of
Information Security from Fudan University in 2010.
His current research interests include keyword search
and knowledge.

Zhenying He is an associate professor in School of
Computer Science, Fudan University, China. He received
Ph.D. of computer science from Harbin Institute of
Technology. His current research interests include data
mining, data management and database theory.
Liping Gao (1980-) graduated from Fudan University,

China with a PhD in 2009 in Computer Science. She
received her BSc and master degree in Computer
Science from Shandong Normal University, China in
2002 and 2005 respectively. She is doing her research
work in University of Shanghai for Science and Technology
as an assistant professor. Her current research interests
include CSCW, heterogeneous collaboration, consistency
maintenance and collaborative engineering.
157
Liujie Sun is a professor at University of Shanghai for
Science and Technology, Shanghai, China. He received
his B.S. in the Huazhong University of Science and
Technology (1986), Wuhan, China; M.S. in University
of Shanghai for Science and Technology (1989), Shanghai, China; Ph.D. in University of Shanghai for Science
and Technology (2008), Shanghai, China. His main
research interests are on media data management
and multimedia technology.

10.1016 J.neucOM.2015.04.084 A Comprehensive Structural Based Similarity Measure in Directed Graphs

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

10.1016 J.neucOM.2015.04.084 A Comprehensive Structural Based Similarity Measure in Directed Graphs

Hochgeladen von

Copyright:

Verfügbare Formate

Neurocomputing 167 (2015) 147157

Contents lists available at ScienceDirect

A comprehensive structural-based similarity measure

nodes? or Which other nodes are most similar to this one?. It is

M. Zhang et al. / Neurocomputing 167 (2015) 147157

edge represents a citation between papers. As shown in Fig. 1(a),

years. With respect to the focus of this paper, below we briey

M. Zhang et al. / Neurocomputing 167 (2015) 147157

from multiple heterogeneous data sources. PathSim [32] is a meta

M. Zhang et al. / Neurocomputing 167 (2015) 147157

3.2.1. Node activity

and web networks, follow a power-law degree distribution [8]. The

where C is a decay factor between 0 and 1, a Oa [ fag and

is the transition probability with considering ERS from a to p. Here

3.2.4. Computing ERS

And when iteration l 1 4 0, the similarity between a A V and

The computation procedure of ESimRank scores is shown in

Algorithm 1. ESimRank computation algorithm.

M. Zhang et al. / Neurocomputing 167 (2015) 147157

Theorem 2 can proof similarly by the proof of optimized

The similarity computation of ESimRank is similar to SimRank

The optimized ESimRank computation procedure is shown in

1: Initialize R00 a; b; 8 a; b A V by Eq. (3);

The naive ESimRank computation process would suffer from

4.1. Rewrite ESimRank equation

For q A fqjRl l 11 p; q a 0g do

We propose extended partial sums by integrating ERS into the

the similarity function Rl l 11 a; b over a set of thresholds

R1 a; b, and for l 1 Z 2; Rl l 11 a; b is dened as Rl l 11 a; b

1 if a b, otherwise, Rl l 11 a; b is approximatively dened as

if right-hand side of Eq. (5) is bigger than l 1 , otherwise Rl l 11 0,

Compared to the naive ESimRank algorithm, the computation

M. Zhang et al. / Neurocomputing 167 (2015) 147157

meetings of different path lengths into similarity computation,

Citation network: The citation network is built by the high

energy physics paper dataset1 [17,10]. We get 9831 papers by

The out-degree distributions of these datasets are shown in

top k similar nodes for a given node vi , which is dened as

where j is the rank of node vj in the returned result list, and

M. Zhang et al. / Neurocomputing 167 (2015) 147157

M. Zhang et al. / Neurocomputing 167 (2015) 147157

MAP scores of ESimRank and Opt-ESimRank are higher than the

5.2.2. Effectiveness enhancement of different factors

practice. The fourth column corresponds to the MAP scores with

Without any factor

M. Zhang et al. / Neurocomputing 167 (2015) 147157

5.3.2. Similarity matrix size

ESimRank, which is consistent with our previous discussion that

6. Conclusion and future work

M. Zhang et al. / Neurocomputing 167 (2015) 147157

Our research also has some limitations which are summarized

Mingxi Zhang is currently a lecturer in University of

M. Zhang et al. / Neurocomputing 167 (2015) 147157

Liping Gao (1980-) graduated from Fudan University,

Das könnte Ihnen auch gefallen