Sie sind auf Seite 1von 13

Utility-Driven Graph Summarization

K. Ashwin Kumar Petros Efstathopoulos


Symantec Research Labs Symantec Research Labs
ashwin kayyoor@symantec.com petros efstathopoulos@symantec.com

ABSTRACT and billions of nodes and edges. For instance, Google stores more
than 1 trillion indexed pages that contain billions of incoming and
A lot of the large datasets analyzed today represent graphs. In many outgoing links. Similarly, Facebook has 800 million active users and
real-world applications, summarizing large graphs is beneficial (or related network data. At the current rate of data volume increase,
necessary) so as to reduce a graph’s size and, thus, achieve a number it is becoming highly impractical to store, process, analyze, and
of benefits, including but not limited to 1) significant speed-up for visualize these big graphs. Therefore, in order to make graph data
graph algorithms, 2) graph storage space reduction, 3) faster network management, processing and visualization tractable, summarization
transmission, 4) improved data privacy, 5) more effective graph techniques are becoming increasingly important.
visualization, etc. During the summarization process, potentially There is a plethora of benefits to employing graph summarization
useful information is removed from the graph (nodes and edges are methods. First, given planetary scales of real-world graphs [18],
removed or transformed). Consequently, one important problem graph summarization helps in reducing the size of the graph thereby
with graph summarization is that, although it reduces the size of reducing the on-disk storage footprint. The reduced graph can
the input graph, it also adversely affects and reduces its utility. The also be loaded directly into memory to improve the performance
key question that we pose in this paper is, can we summarize and of analytics algorithms [25]. Second, many graph algorithms that
compress a graph while ensuring that its utility or usefulness does are otherwise too complex or costly to run on larger graphs can
not drop below a certain user-specified utility threshold? be efficiently executed on summary graphs, with adequately ac-
We explore this question and propose a novel iterative utility- curate results [16]. Third, most of the real-world graphs suffer
driven graph summarization approach. During iterative summariza- from a “small world” effect which makes them look too tangled
tion, we incrementally keep track of the utility of the graph summary. to be effectively visualized and interpreted, resulting in the “hair-
This enables a user to query a graph summary that is conditioned ball” graph phenomenon. Graph summarization essentially makes
on a user-specified utility value. We present both exhaustive and them simpler to visualize on a small screen in-turn helping with
scalable approaches for implementing our proposed solution. Our better analysis of these graphs [33, 5, 20, 14, 3]. Finally, when the
experimental results on real-world graph datasets show the effective- original data is privacy sensitive, graph summarization may help
ness of our proposed approach. Finally, through multiple real-world conceal private information [12], thus enabling privacy-preserving
applications we demonstrate the practicality of our notion of utility analytics—especially among multiple mutually-distrustful parties.
of the computed graph summary. A key challenge with graph summarization is that it can have a
PVLDB Reference Format: severe impact on the amount of “useful information” represented
K. Ashwin Kumar, Petros Efstathopoulos. Utility-Driven Graph Summariza- by the graph for the task at hand—i.e., the utility of the graph.
tion. PVLDB, 12(4): 335-347, 2018. Furthermore, it is difficult to predict the reduction in utility a graph
DOI: https://doi.org/10.14778/3297753.3297755 will suffer when summarized. Ideally, we should be able to estimate
the utility at each summarization step so that the obtained graph
1. INTRODUCTION summary meets a user-specified utility threshold. To the best of our
A lot of the vast amounts of information we are producing and knowledge, state-of-the-art graph summarization approaches [23,
analyzing today can be represented as graphs. This fact becomes 25] focus primarily on minimizing graph reconstruction error, and
clear if one consider all the real-life data networks that can be largely ignore the utility aspect—where the relative importance of
abstractly perceived as nodes connected by edges: social networks, nodes and edges should be considered during the summarization
financial transaction networks, communication networks, citation process. To address this gap, we pose the following key question:
networks, parcel shipment data, protein-protein interaction networks, Can we summarize a graph and compress it as much as
gene regulatory networks, disease transmission networks, ecological possible, while ensuring that its utility does not drop below a
food networks, sensor networks, just to name a few. The size of user-defined utility threshold?
such graphs is growing at an unprecedented rate, spanning millions
In other words, we desire a graph summarization system that permits
a user to query a graph summary with given utility. To achieve
This work is licensed under the Creative Commons Attribution- this, our summarization algorithm must be able to keep track of
NonCommercial-NoDerivatives 4.0 International License. To view a copy the utility of the graph at each step of the summarization process.
of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For Moreover, we need utility estimation to be inexpensive and yet
any use beyond those covered by this license, obtain permission by emailing faithfully represent certain important properties of the underlying
info@vldb.org. Copyright is held by the owner/author(s). Publication rights graph which we want to retain in the computed graph summary.
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 12, No. 4
In our effort to achieve these goals, we evaluated various graph
ISSN 2150-8097. summarization techniques that have been proposed. In the sparsifica-
DOI: https://doi.org/10.14778/3297753.3297755 tion approach edges are filtered based on certain criteria to simplify

335
the underlying graph. On the other hand, the sampling approach per- the nodes u and v. Essentially, VS consists of disjoint sets (supern-
forms sampling of a subset of nodes or edges so as to form a simpler odes) of nodes in V such that V = ∪ki=1 Si and Si ∩ S j = 0 (∀i 6= j).
representation of the original graph. The most popular approaches, In GS , the edges ENSi ⊂ E connecting the set of nodes NSi belonging
however, are different variants of the grouping approach, that em- to a particular supernode Si are not maintained. Whereas, only edges
ploy meaningful grouping of nodes into supernodes and edges into connecting individual supernodes are maintained. Also, if supern-
superedges to compute a graph summary. Grouping approaches owe odes Si and S j are connected with a superedge, then Ai, j represents
their popularity to the fact that they are expressive enough to allow the actual cross edges connecting the nodes in Si and S j . On the
a user to logically explain the computed graph summary with re- other hand, ∏i, j denotes the bipartite graph connecting the nodes
spect to the original underlying graph. Moreover, iterative grouping
approaches allow us to record the list of corrections made across in supernodes Si and S j where (Si , S j ) ∈ VS . Alternative notations
the iterations, which can help us to reconstruct the exact original for Ai, j and ∏i, j that we use in this paper are ASu ,Sv and ∏Su ,Sv
graph, or an approximate version of it, from the summary if needed. where u ∈ V, v ∈ V . Also, in this work, we assume un-directed,
Subsequently, helps with provenance and explainability, where one un-weighted and edge un-labeled graphs.
can explain the steps taken to reach a particular summary for a Reduction in Nodes (RN). We understand the effectiveness of
|V |−|V |
given graph (useful for forensics, anomaly detection etc). Also, the our proposed techniques on varying RN. Formally, RN = |V | S ,
iterative nature of the algorithm (grouping of nodes into supernodes where a value of 0.2 means 20% of original nodes are collapsed into
and edges into superedges and vice versa) enables meaningful visu- supernodes and summary retains 80% of the graph unmodified.
alization and complex analysis during the summarization process. Zero Loss Encoding Transformations. We define certain encod-
Therefore, for all the benefits it provides, we specifically focus on ing transformations (as shown in Figure 1) used to represent a group
iterative grouping-based graph summarization approaches. How- of nodes and edges in graph G with supernodes and superedges
ever, since grouping-based graph summarization with minimum in a summarized graph GS without loss of information. Rule 1: a
reconstruction error is shown to be NP-Hard [35], it is common to group of nodes that are not connected to each other in the graph G
use heuristics and approximations to implement such algorithms. is simply represented by a supernode without a self-loop. Rule 2:
In this paper, we propose a novel utility-driven graph summariza- a group of nodes that form a clique in graph G is represented by a
tion (UDS) technique, where graph utility is incrementally computed supernode (with a self-loop). Rule 3: if there is an all-to-all connec-
while iteratively performing the summarization. This allows us to tion between two sets of nodes, then they are represented by two
obtain a summary with a user-specified utility threshold, thus offer- supernodes connected with a single superedge. “Zero loss” in this
ing the benefits of summarization while providing utility guarantees. context means that if we apply these transformations in reverse order
Our contributions in this work are as follows: on a graph summary, then we should be able to obtain the original
1. We introduce a new framework to measure the utility of a graph graph without needing any additional information or corrections.
while it is being perturbed by the deletion of existing edges or the Note that in this context, zero loss also implies 100% utility because
addition of spurious edges. Furthermore, we judiciously extend the transformations are able to preserve all the salient regions of G.
it to compute utility for graph summaries. We make use of these transformations during summarization and
2. We present theoretical result showing computational intractabil- calculation of utility (Sections 3 and 4.1).
ity of UDS problem for obtaining a near optimal solution.
3. We introduce a novel UDS algorithm that iteratively summarizes
a given graph by employing an objective function that maximizes G
the utility at each step of the transformation. Also, during itera-
tive summarization of the graph, UDS incrementally computes Gs
and keeps track of the running utility value.
4. We improve scalability by orders of magnitude by proposing a
memoization-based approach for UDS. Figure 1: Examples of three encoding rules for zero-loss summarization
5. We conduct a comprehensive experimental study using several
real datasets and applications, and the results demonstrate that Utility (EU). The utility 0 ≤ EU ≤ 1 of any graph GS that is ob-
UDS is capable of generating high-utility graph summaries. tained by transforming an graph G indicates the usefulness of GS
The rest of the paper is organized as follows: In Section 2, we with respect to G. The higher the extent to which important regions
present the relevant background and the different concepts discussed in G are preserved in the transformed graph, the greater the utility.
in this paper. In Section 3, we present the formal definition of Example of Utility-Driven Graph Summarization (UDS). Let
utility, describe the set of properties and conditions that a desirable us consider an example. Figure 2 presents iterations of a desirable
utility metric should satisfy and introduce a generic framework UDS system. We envision a summarization system that reports at
to estimate utility of a perturbed graph given its base graph. We each iteration the current EU and RN values of graph summary
present our UDS approach in Section 4.1. We describe how we use GS . Figure 2 offers the values for EU and RN, whose calcualtion
memoization to improve the scalability of our technique for UDS in is discussed in-detail in the coming sections. The input graph is
Section 4.2. In Section 5 we present experimental results evaluating shown in Figure (2a). The user provides a utility threshold ΓU as
the efficiency and effectiveness of UDS. Finally, we present related a predicate to the UDS system, indicating that the summary GS
work in Section 6 and conclude in Section 7. should have utility no less than ΓU . In this example, let’s say ΓU
equals 0.9. Figures (2b) – (2h) show the first eight iterations of
graph summarization with varying EU and RN values along the
way. At every iteration, a pair of nodes is selected and collapsed to
2. PRELIMINARIES form supernodes, and neighboring edges are adjusted accordingly.
In this section, we present the background for graph summariza- The summarization system analyzes the important parts and regions
tion and the different concepts discussed in this paper. of the input graph (i.e., the output of the previous iteration) and
Graph Summary. Given a graph G = (V, E), its graph summary prioritizes the order in which nodes are collapsed accordingly. In
GS = (VS , ES ) where VS = {S1 , S2 , . . . , Sk } is a set of supernodes every iteration, the objective is to preserve important regions of
such that k < |V |. If u ∈ V, v ∈ V , then Su represents the supernode the G as much as possible in GS . In the first iteration, two nodes
containing node u and Suv represents the supernode containing both are collapsed into a supernode, and edges are adjusted accordingly.

336
(a) Input graph (b) EU:1.0, RN:0.06 (c) EU:1.0, RN:0.13 (d) EU:0.99, RN:0.19 (e) EU:0.98, RN:0.25 (f) EU:0.98, RN:0.3 (g) EU:0.95, RN:0.38 (h) EU:0.92, RN:0.44
Figure 2: Example output of utility-driven graph summarization

Note that, in this iteration, the EU value remains 1.0 because we can G are preserved in G0 ) then the utility of the GS is higher. The
still reconstruct G from GS by simply applying the decoding rules of reconstructed graph G0 obtained from GS is equivalent to a graph
Figure 1. Also, RN = 0.06 as the number of nodes in GS is reduced G0 obtained by perturbing G (by adding certain spurious edges, or
by 1. By the end of the second iteration, EU remains 1. In the third removing original edges, or both). Therefore, from now on we will
iteration, however, EU is reduced to 0.99, since reconstructing G call the reconstructed graph as the perturbed graph. Next, we present
from GS produced in this step will introduce spurious edges. The a generic framework to calculate the utility of G0 with respect to G.
reduction in EU is 0.01, based on the extent to which important Generic Framework for a Graph Utility Function. Our key in-
regions are affected in G. Similarly, all iterations from 4 to 7 cause tuition is to penalize the utility of graph G0 in accordance with the
a drop in EU. Note that the quantum of reduction in EU from (2d) introduced perturbations. The amount of cost or penalty should be
to (2e) is less than (2f) to (2g). This is because the merge step at based on the importance of edges that are missing, or the number of
(2e) preserves important regions better than the merge step at (2g). spurious edges introduced, or both. An intuitive way to assess the
This will be explained in detail in coming Sections. Overall, the relative importance of edges in the orginal graph G is by computing
algorithm terminates at the seventh iteration (2h) as any attempt to normalized edge centrality scores edgeIS. If {E − E 0 } is the set of
further summarize the graph would cause the EU to drop below edges missing from G0 compared to G’s original edges, then the
the user-specified threshold ΓU = 0.9. Finally, the computed graph utility of G0 is penalized by the sum of relative importance scores of
summary GS (in Figure (2h)) is presented to the user as the output. missing edges. Next, we should penalize G0 ’s utility according to
any spurious edges it contains, that did not exist in G. We do this by
3. UTILITY OF A GRAPH SUMMARY calculating the proportion of spurious edges introduced in G0 to the
The fulcrum of this work is our proposed method for calculating total number of spurious edges possible in the base graph G. More
the utility of a graph summary with respect to an underlying graph. formally, the maximum number of spurious edges that can be intro-
duced in G is |V2 | − |E|. If {E 0 − E} is the set of spurious edges

We approach this problem by attempting to reconstruct the original
0
graph G from a summary GS with no extra information. For re- introduced in G , and assuming homogeneity, then for each spurious
construction, we apply the reverse of the transformations discussed edge the utility EU(G0 )G is penalized by the amount |V | 1 .
in Section 2. This can result in the loss of original edges as well ( 2 )−|E|
as introduction of spurious edges. Supernodes with self-loops are
expanded into a clique of their contained nodes, otherwise they are Algorithm 1 Generic Graph Utility Function (GGUF)
expanded into disconnected nodes. A pair of sets of base nodes form
1: procedure GGUF(G = (V, E), G0 = (V, E 0 ))
a bipartite graph if the corresponding supernodes are connected by 2: utility = 1.0
a superedge, otherwise they are completely disconnected. More for- 3: edgeIS = normalize(edge centrality scores(G))
mally, given GS of graph G, we reconstruct the graph G0 = (V 0 , E 0 ) 4: if G 6= 0/ and G0 6= 0/ then
from GS such that V = V 0 . The number of nodes and the node 5: for e ∈ {E − E 0 } do
set in the G0 are equivalent to that of G, although the number of 6: utility = utility − edgeIS[e]
edges might vary—primarily due to the error introduced by graph 7: end for
8: for e ∈ {E 0 − E} do
summarization. Figure 3 presents an example of a graph G, its
9: penalty = |V | 1
summarization GS , and graph G0 which is reconstructed from GS by ( 2 )−|E|
applying the rules shown in Figure 1 in the reverse order. 10: if penalty < utility then
11: utility = utility − penalty
c
12: else
c a, b a 13: utility = 0
a 1
c, d b d 14: end if
d e, f 2 e m 15: end for
e b
l 4
o f l 16: end if
o f 17: return utility
n 3 l, m, n p
p m o, p, q q n 18: end procedure
q

G GS G0 The value of utility is in the range [0,1]. Given a non-empty and


non-clique graph G, there are four notable conditions under which
Figure 3: Example of a graph G, its summary GS , and reconstruction G0 the utility of G0 is zero: 1) if G0 is an empty graph, 2) if G0 is a clique,
3) if G0 is missing all the original edges, and 4) if G0 contains all the
Once GS is transformed into a reconstructed graph G0 , the problem possible spurious edges. Pseudocode for the generic graph utility
of calculating the utility of a graph summary is reduced to the function GGUF is shown in Algorithm 1. Without loss of generality,
problem of calculating the utility of G0 with respect to G, using a it can be easily extended to weighted graphs where penalties will
utility function denoted by as EU(G0 )G . In essence, when there be weight adjusted. Moreover, the generic nature of GGUF allows
is greater structural similarity between G and the reconstructed us to plug-in a variety of centrality metrics to form different types
graph G0 (i.e., the extent to which important edges and regions in of utility functions each exhibiting different properties. Next, we

337
identify certain intuitive properties a utility function should exhibit properties. Also, in Section 5.2 we present experimental results
and discuss how to assess its desirability. where we try various centrality metrics in GGUF and provide guide-
Assessing the Desirability of a Utility Function To make a utility lines for the right set of centrality metrics to be plugged-in, so as to
function aware of the important regions of G that are preserved in create a utility function that exhibits the properties of Table 1.
G0 , we use a set of fairly intuitive properties described in Table 1 that Discussion. Calculation of utility EU(G0 )G and structural similar-
a desirable graph utility metric should exhibit. The key motivation ity through simple graph edit distance (GED) between G0 and G
in defining these properties and imposing necessary conditions for although seem similar, they differ in significant ways. GED essen-
a desirable utility metric is that the maximization of such a utility tially counts the number of different edges between the original
metric during summarization should help maintain the results of graph and the restructured graph based on the graph summary. It
important graph algorithms, such as ranking and community detec- can be noted that GED does not differentiate between non-important
tion. To further explain these properties and test the desirability of a regions from important regions in the graph as GGUF does. More-
over, in simple GED, cost of edit operations is fixed, whereas in our
Table 1: Properties of a desirable Graph Utility Function case cost of edits is dynamic and depends on the structure of the
Criteria Properties Description original graph. Also, simple GED violates certain key properties
Changes that create disconnected that our utility function satisfies. For example, consider a graph
G = (V, E) with |E| = |V2 | − 1 edges and lets say G0 = (V, E 0 ) be

components or weaken the
connectivity should be penalized
its perturbed graph that is a clique with |E 0 | = |V2 | edges. Then

C1 Edge Importance
more than the changes that
maintain the connectivity 0
utility of G with respect to G is zero (lowest) according to GGUF
properties of the graphs. (Algorithm 1), but a simple GED would calculate the utility value
Spurious Edge More spurious edges must lead to > 0, where utility is calculated as 1 − |E|d
, where d is the number of
C2
Awareness lower utility.
In weighted graphs, higher the edits or distance. Intuitively, a utility value of zero is desirable in
weight of the removed edge or this case, because if G is a non-clique and non-empty, then no matter
C3 Weight Awareness added spurious edge is, the greater how dense G is, if G0 is a clique, then essentially G0 does not reveal
the impact on the similarity any information with respect to G, thus rendering its utility equal to
measure should be. zero. We note that simple GED violates all the desired properties of
A specific change is more
C4 Edge Submodularity important in a graph with fewer
an ideal utility function (Table 1) except C2 whereas GGUF when
edges than in a much denser graph. plugged with appropriate centrality metric satisfies all the properties.
We have included an experiment in Table 5 to demonstrate this. We
also note that, the simplicity of GGUF permits us to easily extend
utility function, we use example model graphs shown in Figure 4
it so as to incrementally calculate the utility of G0 while it is being
with various shapes and varying number of missing edges, such
perturbed. In this case, we start with a utility of 1.0 that represents
as: clique, path, cycle, barbell, wheel barbell, etc. Note that these
G0 = G. As we perturb G0 by deleting (or adding spurious) edges,
examples are not exhaustive and are only meant to explain the key
or both, we penalize the utility accordingly by subtracting the ap-
concepts. Also, it is not necessary that a desirable utility function
propriate cost. Similarly, we incrementally calculate the utility of
exhibits all the listed properties in conjunction; it is only required
GS at each iteration, by analyzing the possible perturbations without
to exhibit each property independently. We present an example test
actually generating the reconstructed graph at each summary step.
criterion 3.1 that uses the shown model graphs to test if a utility
function exhibits the desired property—in this case criterion C1.
Example Test Criteria 3.1 Consider barbell graphs Bn , mBn and 4. UTILITY-DRIVEN SUMMARIZATION
mmBn to explain C1: edge importance criterion. Graph Bn has two We begin our discussion by presenting the mathematical formula-
cliques of size n1 and n2 , such that n = n1 + n2 . Graph mBn has an tion of our problem. Given a graph G = (V, E) and utility threshold
edge removed from one of the cliques in Bn , where graph mmBn ΓU , we want to summarize the graph G as much as possible by
has a missing bridge edge from Bn . In this case, according to edge grouping nodes into minimum number of supernodes VS and form
importance criterion C1, following should satisfy: superedges ES between supernodes such that the difference between
total utility of retained actual edges and total penalty of introduced
(EU(mBn )Bn − EU(mmBn )Bn ) > 0 (1) spurious edges is very close to the given ΓU . Initially, each node in
the original graph is its own supernode in the summary graph.
a a a j a a j
b e
b eb e f ib e f g h i j b
e f i
minimize (|VS |) (2)
c g d g h
c K5 d d c dB10 h c d L10 c
C5 WhB12
a a a j a j
a
Subject to
b e b eb e f ib e f g h i j b e f i
c d c d c d g h c mL10 d g h
  
mC5 d c
mK5 mB10 mWhB12 j
a a a j a a
e f
 | ∏i, j −Ai, j | 
b eb e b ib

e f e f g h j b i
 − |V |
i
∑∑ edgeIS[e]  ≥ ΓU (3)
g d g
 ∑

c d c d c d h c d mmL10 c h

2 − |E|
mmK5 m2C5 mmB10 m2WhB12

a j a j a j a j Si Sj e∈Ai, j
5 2
b e f ib e f i b e f i b e f i (Si ,S j )∈ES
c d g h c g h c d g h g h
d c d
w5B10 w2B10 mmWhB12 mm2WhB12
j j
a k a k a k j f Since problem of graph summarization is shown to be NP-Hard [35],
d e h
b
c
e
d
f
g
ib
h c
e
d
f
g
i b
h c
e
d
f
g h
i c a one may be interested in obtaining a partition that is a p-approximation
BC11 mBC11 mmBC11
b
G9
g i for some p > 1. However, a computational intractability result for
Figure 4: Model synthetic graphs used to validate utility function –
obtaining a near optimal partition can be established as follows.
Kn : clique of size n, Pn : path of size n, Cn : cycle of size n, Ln : lollipop Theorem 4.1 [No Efficient Approximation Theorem] For any
of size n, Bn : barbell of size n, W hBn : wheel barbell of size n, mX : missing X ε > 0, there is no O(n1−ε )-approximation for the problem of ob-
edges, and mmX : missing X “bridge” edges. taining a feasible graph summarization with a minimum number of
supernodes for a given utility threshold, unless NP = ZPP 1 .
Similarly, additional example test criteria are presented in Sec- 1 This intractability result is based on the widely believed assumption
tion 5.1 to test if a utility function exhibits the remaining desired that complexity classes NP and ZPP are different [29].

338
P ROOF. Davidson et al., [4] have proved that for the problem of in the summarized graph GS . All edge connections between nodes
obtaining a feasible clustering with a minimum number of clusters in G are maintained between corresponding supernodes in GS . At
under cannot-link (CL) constraints if, for some ε > 0, there exists each algorithm step, we pick from H the node pair (u, v) with the
O(n1−ε )-approximation for the feasibility problem then that would lowest importance score. Unless nodes u and v belong to the same
imply NP = ZPP. Here, CL constraints involve data points (that supernode Su = Sv , their corresponding supernodes Su and Sv are
are required to be) in different clusters. Following this result, we collapsed into supernode Suv . Let VSuv indicate the set of nodes in
directly reduce the problem of obtaining a feasible clustering with a G belonging to a particular supernode Suv . We calculate the set of
minimum number of clusters to the our problem to prove the result. potential neighbors ηSuv of Suv in GS by finding the set of 1-hop
Given a set of data points D = {d1 , d2 , ..., |V |} . Let Ei, j be the neighbors of all nodes belonging to VSuv in G and by calculating their
measure of distance between data points where 0 < Ei, j ≤ 1 repre- corresponding supernodes in GS . For every unique potential neigh-
sents the points that are relatively closer to each other and Ei, j = 0 bor Sn ∈ ηSuv , where n ∈ G, we need to decide whether connecting
otherwise. Let VS be the set of clusters of data points. Initially each Suv and Sn with a superedge is beneficial for utility. We commit
data point di is its own cluster Si . It is straightforward to see that the decisions for all the potential neighbors in ascending order of
data points D with prior distance values represent a graph G = (V, E) the calculated penalties. Procedure connectSuperEdge(. . .) (pseu-
where values 0 < Ei, j ≤ 1 represent edge weighted graphs and they docode in Algorithm 3, discussed later) returns true if the given
represent edge unweighted graphs if these Ei, j takes value of 0 or pair of supernodes should be connected by a superedge, or false oth-
1. Values of Ai, j , ∏i, j and edgeIS can be calculated based on Ei, j erwise. For a pair of supernodes this procedure calculates 1) seCost:
values. Since these values are defined over the data points that are in
different clusters, constraint (Equation 3) using these values is essen-
tially formed by set of CL constraints. Objective of minimizing the Algorithm 2 Utility-Driven Graph Summarization
number of clusters |VS | for a given set of CL constraints on pair of 1: procedure UDS UMMARIZER(G = (V, E), ΓU )
data points can be directly mapped to the objective of grouping the 2: Initialize: utility = 1; VS = {u : {u} | u ∈ V }; ES = {({u}, {v}) |
nodes from the original graph into minimum number of supernodes (u, v) ∈ E}; S = {u : u | u ∈ V }
3: nodeIS, edgeIS = normalize(centrality scores(G))
subject to the set of constraints involving pairs of nodes in different 4: P2hop = {(a, c) | (a, b) ∈ E, (b, c) ∈ E}
supernodes as shown in Equations 2 and 3. Proof completes.  5: H = sort(P2hop | ↑ ( f (nodeIS[a]) + f (nodeIS[c])), ∀(a, c) ∈ P2hop )
Because it is not possible to devise a feasible or efficient approx- 6: while utility ≥ ΓU and H 6= 0/ do
imation algorithm for the problem at hand. Instead, we rely on 7: (u, v) = H.pop()
greedy heuristics that does best effort at each step taken. 8: if Su 6= Sv then
9: Suv = {Su ∪ Sv }
4.1 Iterative Greedy UDS 10: VS = {VS ∪ Suv } − {Su , Sv }
11: ηSuv = {Sb ∈ VS | b ∈ ηa , ∀a ∈ Suv } − {Su ∪ Sv }
We present a novel iterative greedy UDS algorithm with an incre- 12: for Sn ∈ ηSuv do
mental utility update. Our primary goal is to summarize the given 13: bool, penalty = connectSuperEdge(Suv , Sn , G, edgeIS)
graph G so as to compress it to an extent such that the utility of the 14: ηSuv [Sn ].connect = bool
summary graph GS does not drop below a user-specified threshold 15: ηSuv [Sn ].penalty = penalty
ΓU . To compose our algorithm we need to determine the following 16: end for
17: ηSuv = sort(ηSuv | ↑ (penalty))
steps, based on principles presented in the previous Section: 1) intro-
18: for Sn ∈ ηSuv do
duce a strategy for grouping nodes, 2) find an iterative, utility-driven 19: if ηSuv [Sn ].connect is true then
summarization recipe, 3) come up with appropriate superedge con- 20: ES = {ES ∪ (Suv , Sn )} − {(Sn , Su ), (Sn , Sv )}
nectivity criteria, 4) present techniques to incrementally keep track 21: end if
of utility, 5) optimize the algorithm’s performance and scalability. 22: utility = utility − ηSuv [Sn ].penalty
Prioritizing Candidates to Merge. One way to prioritize the merg- 23: return GS if utility < ΓU
24: end for
ing of nodes is by considering edge importance. The goal is to pick 25: connect, penalty = connectSuperEdge(Suv , Suv , G, edgeIS)
an edge e with the lowest importance and merge the nodes u and v at 26: if connect is true then
e’s end-points so as to form a supernode w. However, this approach 27: ES = ES ∪ (Suv , Suv )
completely forgoes the benefit of merging nodes that are indirectly 28: end if
connected to each other. Many a times, collapsing nodes that are 29: utility = utility − penalty
not directly connected and forming appropriate superedges might 30: end if
31: end while
result in higher utility. For example, it is often beneficial to collapse 32: return GS = (VS , ES )
nodes that have many common neighbors [25]—even not directly 33: end procedure
connected. Therefore, at each step, we consider pairs of nodes that
are both 1) directly connected by an edge, or 2) indirectly (2-hop)
connected via common neighbors, as candidates to form supern- the penalty to connect them with a superedge, and 2) nseCost: the
odes. Given a list of both 1-hop and 2-hop connected node pairs, penalty to not connect them. If connectivity is deemed beneficial—
we seek to prioritize or sort this list in ascending order of impor- i.e., seCost < nseCost—then Suv and Sn are connected through a
tance (denoted by ↑). We calculate the normalized node centrality superedge. Subsequently, the utility is updated by subtracting the
scores nodeIS for the nodes in the base graph and then calculate corresponding penalty values and all the previous connections be-
the combined importance score for a node pair p =< u, v > as the tween (Sn , Su ) and (Sn , Sv ) are removed from GS . This particular
sum of function of normalized centrality scores of the nodes—given way of connectivity decision making guides the summarization al-
by ( f (nodeIS[u]) + f (nodeIS[v])). In our implementation we use a gorithm so as to maximize the utility of GS at each summarization
square function as f () as it helps in further delaying the merging of step. Similarly, the decision to self-connect a supernode Suv or not
important nodes with relatively lesser ones. Let H be the list of node is made based on utility maximization: if a self-loop is deemed
pairs sorted by their combined importance scores. Also, let edgeIS beneficial, then a self-connection (Suv , Suv ) is added to the set of
be the map that maps each edge to its importance score. An edge superedges ES . In the next iteration, the node pair from H with the
importance score is calculated as the normalized edge centrality. next lowest importance score is evaluated. The algorithm terminates
Iterative Greedy Summarization. As shown in Algorithm 2, we and returns the final GS when the current utility of GS satisfies the
initially map each node in the base graph G to a unique supernode utility threshold ΓU or when all node pairs have been evaluated.

339
Superedge Connectivity Decision Making. Let us discuss the Algorithm 3 Utility-Driven Superedge Connectivity Decision
details of the procedure connectSuperEdge(. . .), as shown in Al- Maker and Incremental Utility Calculator
gorithm 3. As mentioned before, this procedure returns true if
connecting two given supernodes by a superedge is beneficial in 1: procedure CONNECT S UPER E DGE(VSw ,VSn , G, edgeIS)
2: Initialize: penalty = 0; seCost = 0; nseCost = 0; decision =
terms of utility, or false otherwise. The benefit is defined as the / cf−
+ = 0;
f alse;CF = (cap, bSize, f Size); c fse + = 0; − =0
se = 0;
/ c fnse / c fnse /
minimum penalty that paid (lost utility) when a particular action is
totalSE = |V2 | − |E|

3:
performed. In our case, there are two possible cases to evaluate, 1)
4: for u ∈ VSw do
connecting two supernodes Su and Sv by a superedge (Su , Sv ) ∈ ES , 5: for v ∈ VSn do
and 2) not connecting the supernodes (Su , Sv ) ∈ / ES . Note that, the 6: if u 6= v and (u, v) not seen before then
two supernodes in question can be the same (see line 26, Algo- 7: e = (u, v)
rithm 2), in this case, we evaluate an action of self-connecting the 8: if e ∈ E and e ∈ CF then
given supernode with a superedge (self-loop). Let’s understand the 9: seCost = seCost − edgeIS[e]
10: − = cf− ∪e
c fse
implications of each of the actions below: se
11: else if e ∈/ E and e ∈ CF then
• Case 1: (Su , Sv ) ∈ ES When two given supernodes Su and Sv are 12: 1
nseCost = nseCost − totalSE
connected by a superedge, it induces all-to-all connection ∏Su ,Sv be- 13: − −
c fnse = c fnse ∪ e
tween the set of base nodes contained in Su and Sv (per the encoding 14: else if e ∈ E and e ∈ / CF then
rules of Figure 1). Consequently, apart from original cross edges 15: nseCost = nseCost + edgeIS[e]
16: + +
c fnse = c fnse ∪ e
Au,v ⊂ E in the G, we are introducing an additional set of spurious
edges {∏Su ,Sv −ASu ,Sv } between the set of nodes contained in Su and 17: else if e ∈/ E and e ∈ / CF then
1
18: seCost = seCost + totalSE
Sv . Essentially, at this step, reconstruction of G from the current GS + = cf+ ∪e
19: c fse
(as discussed in Section 3) would introduce | ∏Su ,Sv −ASu ,Sv | number 20: end if
se

of spurious edges as a result of the current action. Additionally, we 21: end if


know that for each introduced spurious edge the utility is penalized 22: end for
by an amount |V | 1 . Let seCost be the total penalty or cost 23: end for
(( 2 )−|E|) 24: if seCost < nseCost then
associated with the action of connecting supernodes Su and Sv . 25: penalty = seCost
• Case 2: (Su , Sv ) ∈ / ES We know that if supernodes Su and Sv are 26: CF.insert((u, v)), for all (u, v) ∈ c f se
+
not connected, then we are missing the set of ASu ,Sv original edges. 27: CF.delete((u, v)), for all (u, v) ∈ c f se

In other words, reconstruction of G from the current GS would 28: decision = true
29: else
have deleted |ASu ,Sv | number of edges that existed in G. Since the 30: penalty = nseCost
importance score of each edge e in G is given by edgeIS[e], for 31: CF.insert((u, v)), for all (u, v) ∈ c f nse
+
each missing edge e the utility has to be penalized by an amount 32: CF.delete((u, v)), for all (u, v) ∈ c f nse

of edgeIS[e]. Let nseCost be the total penalty associated when an 33: decision = false
action of not connecting supernodes Su and Sv is performed. 34: end if
Finally, if seCost > nseCost, then the benefit of not connecting 35: return (decision, penalty)
36: end procedure
the given supernodes Su and Sv is higher and vice versa2 .
Incremental Utility Calculation. To accurately calculate the util-
ity at each iteration in an incremental fashion, we need to keep
track of all actions and related penalties that have been imposed in etc.) for the purposes of bookkeeping. Instead, we need a more
previous iterations. This bookkeeping is explicit, to avoid redundant space-efficient data structure to keep track of processed edges.
penalization of the utility at each iteration. For example, let’s say Probabilistic Data Structures to the Rescue. A Bloom filter is a
we are evaluating the action of connecting two supernodes Su and Sv potential option as it is a space-efficient data structure that can be
by a superedge. Performing this action equates to the introduction used to keep track of already processed edges. Processed edges
of one or more spurious edges in the underlying graph between the marked in the Bloom filter indicate that the utility has been (poten-
sets of base nodes contained in Su and Sv . In principle, we must tially) penalized for these edges. As discussed before, often certain
penalize the utility for the introduced spurious edges. However, it penalties for already processed edges need to be rolled back. This
may be the case that in previous summarization steps, the utility has implies that these edges should be deleted from the Bloom filter in
already been penalized for spurious edges that we are considering in such situations. Unfortunately, the standard Bloom filters do not
the current step. Thus, we need to keep track of spurious edges that support deletion of items. However, certain variants of Bloom filter
we have penalized the utility for at each iteration. On the other hand, such as counting Bloom filter allow both addition and deletion of
when we are evaluating (Su , Sv ) ∈ ES , we need not penalize for the items, but with significant space overhead. In fact, counting Bloom
original cross edges ASu ,Sv between Su and Sv . However, in previous filters [9, 8] are known to use 3–4× space to retain the same false
iterations, some finalized action might have penalized the utility for positive rate as a space-optimized Bloom filter. Fan et al., [8] intro-
some or all of these original edges ⊂ E. Thus, we need to rollback duced Cuckoo filters (CF). CF possess the dual advantage of space
the penalty of these edges in the current action. This indicates that efficiency as well as the ability to handle deletion of items. Given
we need to keep track of original edges as well as spurious edges their advantages, we make use of CF to manage the bookkeeping of
that we might have penalized the utility for in previous iterations. processed edges and corresponding rollbacks.
Accordingly, the amount of bookkeeping needed is in the order of Over-Optimism in Utility. We know that probabilistic data struc-
O( |V2 | ). This large space requirement makes it impractical to use

tures suffer from the problem of false positives—i.e., they may
any kind of deterministic data structure (list, hash table, hash set, identify an item as a set member even though it is not. Cuckoo
filters allow the false positive rate to be controlled by varying the
2 Note that our algorithm can be modified slightly to provide k- capacity and fingerprint size [8]. Because of false positives intro-
anonymity guarantees [12] under favorable conditions. A supernode duced by CF, there is a possibility of unwarranted optimism in the
calculation of utility. From Algorithm 3, we know that c fse − is the set
comprising of k nodes will be k-anonymous—and the supernode
comprising of the minimum number original nodes can be consid- of original edges already processed in previous steps as confirmed
ered an anonymity lower bound. by CF, and c fnse+ is the set of original edges that are yet to be evalu-

340
− is the set of spurious edges already evaluated
ated. Whereas, c fnse In summary, use of CF for the purpose of incremental utility
in previous steps as confirmed by CF, and c fse + is the set of spurious calculation can result in over-optimism because of false positives.
edges yet to be processed. We analyze two specific cases. However, with the careful selection of capacity and fingerprint size
In the case where (Su , Sv ) ∈ ES , we connect the given supernodes of the CF, f p can be made sufficiently small. Subsequently, utility
Su and Sv with a superedge. This action introduces spurious edges over-estimation becomes almost negligible. Essentially, increasing
between the nodes in the given supernodes. We denote this set of capacity improves the occupancy of a cuckoo hash table whereas in-
spurious edges as {∏Su ,Sv −ASu ,Sv }. The set of spurious edges c fse + creasing fingerprint (hashes) size rejects more false queries, thereby

that are yet to be evaluated is calculated as {∏Su ,Sv −ASu ,Sv − c fnse }. reducing f p but with the caveat of increased space overhead.
We want to penalize the utility for extra spurious edges that have Time Complexity Analysis. Since the calculation of importance
been unprocessed in previous iterations. In addition, we need to scores (Algorithm 2, line 3) depends on the choice of underlying
rollback penalties for the original cross edges that were processed in centrality algorithm, we will focus on the time complexity of the it-
previous iterations for which the utility has already been penalized. erative node merging algorithm (lines 6–32). In each merge step, for
The total penalty seCost is calculated by subtracting the total cost each potential neighbor of merged supernode O(dav ), we evaluate
− from the total cost of edges c f + :
of edges in c fse connectivity between merged supernode and its potential neighbor
se
O(|V |2 ). Therefore, the overall complexity of each merge step
comes out to O(|V |2 dav ), where dav is the average degree.
!  
+|
|c fse
seCost = |V |
− ∑ edgeIS[e] (4) Limitations. The key limitation of Algorithm 2 is that it does not
2 − |E| −
e∈c fse
scale well for large graphs. This is because node merging and su-
The current utility at this step is calculated as utility = utility − peredge decision making (lines 6–32) are exhaustive in nature and
seCost. perform redundant computations. For example, consider Figure 5(a)
that shows a portion of the base graph where nodes a, b, and c are
Theorem 4.2 If f pr is the false positive rate of CF and if (Su , Sv ) ∈
more densely connected to nodes 1, 2, 3, and 4 in comparison to
ES , then we have upper bound on utility over estimation δse where
node set e, f , g. Figure 5(b) shows an iteration of graph summa-
− |
| ∏Su ,Sv −ASu ,Sv − c fnse f pr rization where three supernodes S1 = {a, b, c}, S2 = {e, f , g} and
δse ≤ |V |
× (5) S3 = {1, 2, 3, 4} are formed. In this iteration, supernodes S1 and
1 − f pr
2 − |E| S2 are evaluated against S3 for connectivity. Total 12 comparisons
P ROOF. By the definition of the false positive rate, we know that (denoted by com(S1 , S3 )) are made to decide connectivity between
S1 and S3 and 12 comparisons are performed for S2 and S3 . Also, 3
| f alse positives| comparisons each are made to decide self-connectivity for supern-
f pr =
| f alse positives| + |true negatives| odes S1 and S2 . So in total 30 comparisons are made for the case
From this we can derive an expression for f alse positives in-terms shown in Figure 5(b). However, in the next iteration (Figure 5(c)),
+ | be the set of spurious
of f pr and true negatives. Also, let |detse we are merging S1 and S2 to form supernode w. In order to evaluate
− | be the set of original
edges that are yet to be evaluated, and |detse connectivity between w and S3 we perform 24 redundant compar-
edges already processed in previous steps as confirmed by a deter- isons between the nodes contained in supernode w and nodes in S3 ,
ministic data structure (e.g., Hash Table). We know that c fse + and that have already been performed in the previous iteration. Even to
− are calculated based on a probabilistic data structure, in our
c fse decide the self-connectivity of w, many (9) redundant computations
case, a Cuckoo Filter. Therefore, the utility over-estimation is the are performed. In total, we count 33 redundant comparisons that
difference between seCost calculated based on the deterministic and could have been avoided if we were to reuse previous computations.
probabilistic data structures. This insight leads us to a more efficient approach, discussed next.

δse = seCost 4 S1 w=(S1 U S2)


a a a
! ! S3 S3
+ − c f +| b b b
|detse se 1 c
1
2
c 1 c
= |V |
− ∑ edgeIS[e] 2
3 3
2
3
d
f
2 − |E| − −c f − }
e∈{detse se
4 d
e
4 d
e
4 e
f other f
other other
nodes nodes nodes
To find the upper bound, S2
 we need to find the maximum
 value of
seCost 4 or minimize ∑e∈{detse− −c fse− } edgeIS[e] . We know that Total comparisons =
com(S1, S3) + com(S2, S3)
Total comparisons =
com(w, S3) + com(w, S3)
+ com(S1, S1) + com(S2, S2) + com(w, w) = 24 + 15
we need at least one edge between the nodes in Su and Sv to connect = 12+12+3+3 = 30 = 36

these supernodes with a superedge. Let’s consider a single edge (a) (b) (c)

connecting Su and Sv and let ε be the importance score of this edge. Figure 5: Example illustrating redundant computations (a) Portion of origi-
For a given original graph of large size, the value of ε can be close nal graph, (b) Portion of graph summary showing superedge decision making
to zero and we can safely ignore it. So we have: between supernodes (S1 , S3 ), (S2 , S3 ) and self-connections, (c) Portion of
graph summary showing superedge decision making between supernodes
+ − c f +|
|detse |det + − c fse
+| (w, S3 ), (w, S3 ) and self-connections
se
δse ≤ |V |
− ε = |Vse|
2 − |E| 2 − |E|
| f alse positives| |true negatives| f pr 4.2 Memoization based Approach
≤ |V |
= |V |
×
− |E| − |E| 1 − f pr To overcome scalability challenges, we introduce a memoiza-
2 2
tion technique as a scalable approach to UDS. The key goal is to
Here true negatives is nothing but the set of spurious edges yet to compute graph summaries and perform incremental utility calcu-
+ ) and we know that c f + = {
be evaluated (i.e., c fse se ∏Su ,Sv −ASu ,Sv − lation by reusing previous computations. Initially, each node and

c fnse }. Thus, we have an upper bound for the utility overestimation. edge in the base graph G is its own supernode and superedge in the
 summary graph GS . We start by defining three variables for each
superedge (Sa , Sb ) ∈ ES in GS : seCost(Sa , Sb ), nseCost(Sa , Sb ) and
/ ES , we can calculate the utility
Similarly, in the case of (Su , Sv ) ∈ (Sa , Sb )exist . Because Sa and Sb are already connected, the value of
over estimation by analyzing the cost of not connecting any Su , Sv . seCost(Sa , Sb ) is initialed to 0 (for all superedges). Also, initially

341
when Sa = {a} and Sb = {b}, not deciding to connect a superedge then the total penalty cost of merging any two supernodes Su and Sv
between supernodes Sa and Sb incurs a cost of edgeIS[(Sa , Sb )]. is calculated by summing the corresponding individual costs of eval-
Therefore, nseCost for all superedges is initialized to the corre- uating connectivity of Suv with its potential neighbors. Equations 11
sponding edgeIS[e] values. Whereas, (Sa , Sb )exist indicates if a and 12 show the calculations.
given superedge is permanent (with value of 1) or ephemeral (value seCost(Suv ) = seCost(Suv , Sw ) (11)
of 0). An ephemeral superedge indicates that we have not decided to

w∈ηm
connect the two given supernodes based on the result of a superedge ∀m∈{Su ∪Sv }

decision-making process, while a permanent superedge indicates nseCost(Suv ) = nseCost(Suv , Sw ) (12)


the opposite. The key advantage of an ephemeral superedge is that

w∈ηm
it provides a low-cost way to store calculated penalty costs for both ∀m∈{Su ∪Sv }

connecting and not connecting a particular superedge. Although Finally, the utility penalty or costs associated with merged supern-
an ephemeral superedge is not considered a real edge, it helps us ode Suv ’s self-connectivity decision making can also be calculated
judiciously re-use the pre-computed penalty costs stored in it for by adding pre-computed costs of Su ’s self-connectivity (Su , Su ), Sv ’s
upcoming cost computations. Initially, all the superedges are perma- self-connectivity, and the cost associated in evaluating supernode
nent, therefore, the value of (Sa , Sb )exist for all edges (Sa , Sb ) ∈ ES pair (Su , Sv ). Calculations are shown in Equations 13 and 14.
is set to 1. Initialization of all the superedge variables with required seCost(Suv , Suv ) = seCost(Su , Sv ) + seCost(Su , Su ) + seCost(Sv , Sv ) (13)
conditions is shown in Equation 6.
  nseCost(Suv , Suv ) = nseCost(Su , Sv ) + nseCost(Su , Su ) + nseCost(Sv , Sv )
seCost(Sa , Sb ) = 0 (a, b) ∈ E, (14)
nseCost(Sa , Sb ) = edgeIS[(Sa , Sb )] if Sa = {a}, Sb = {b}, (6) In summary, given a newly merged supernode Suv and its poten-
(Sa , Sb )exist = 1
 
(Sa , Sb ) ∈ ES tial neighbor Sw , to evaluate connectivity between them, we reuse
previous computations between Su , Sw and Sv , Sw as opposed to re-
After initialization, in the upcoming iterations, connectivity costs dundantly performing comparisons between base nodes contained
seCost and nseCost can be calculated by reusing costs calculated in Suv and Sw , as done in the previous approach (Section 4.1). As
from previous iterations as shown in Equations 7 and 8. For instance, shown in Algorithm 2 (lines 12–16), we do this for all potential
let’s say at iteration t we are evaluating connectivity between supern- neighbors. As a result, by avoiding redundant computations, we
odes Su and Sw and calculate utility penalty costs seCost(Su , Sw ) have effectively reduced the complexity of each merge step from
and nseCost(Su , Sw ). Let’s say we decided not to connect Su and O(|V |2 dav ) in the previous approach, to O(dav ) in the current ap-
Sw because nseCost is less than seCost. At this point, the cur- proach. Also, by storing penalty costs (seCost and nseCost) for
rent utility is calculated as utility = utility − nseCost(Su , Sw ). So each superedge and using the concept of ephemeral edges we have
in the summary graph we connect an ephemeral edge between introduced an extremely low-overhead way to keep track of the
the given supernodes and set (Su , Sw )exist = 0. In a particular penalty costs for all the pairs of supernodes—whether decided to
future iteration t + k where k ≥ 1, if we want to calculate the connect them or not. These stored penalty costs are used to effi-
cost to connect supernodes Su and Sw , then we need to nullify ciently calculate costs for upcoming computations.
the previously subtracted penalty for disconnecting the given su- Discussion: While memoization reduces the time complexity of
pernodes in the iteration t. More formally, seCost(Su , Sw ) at it- each merging step, the time complexity of computing the importance
eration t + k is calculated by reusing previous computations as scores can be still high. We improve the performance of this step
seCost(Su , Sw )t+k = seCost(Su , Sw )t − nseCost(Su , Sw )t . However, by making use of the fast approximation algorithms for centrality
if supernodes Su and Sw were never evaluated before for connectivity, calculation. For example, considering betweenness centrality based
then seCost is calculated by estimating the penalty for introducing utility function, we make use of an approach that uses random
spurious edges across the nodes contained in the given supernodes. sampling of shortest paths to estimate centrality values for all the
In similar essence, nseCost is calculated by reusing previously com- nodes/edges [32]. Algorithm runs in the order of O(|E|) per sample
puted values as shown in Equation 8. and interestingly, the number of samples needed to compute a good
approximation to all vertices is a constant and independent from G.
(Su , Sw ) ∈ ES ,
 
Finally, we note that the techniques proposed in this paper are
 seCost(Su , Sw ) − nseCost(Su , Sw )} if (S , S ) = 0


u w exist not just limited to un-directed, and un-weighted graphs. For in-
seCost(Su , Sw ) = |S | × |S | )
u w stance, calculation of importance scores can be easily adapted to
/ ES
if(Su , Sw ) ∈

 |V |

directed/weighted graphs as centrality computing algorithms exist
2 − |E|
(7) for directed, weighted graphs as well. On the other hand, in the
grouping step, node pair candidates to merge at each step can also
(Su , Sw ) ∈ ES ,

nseCost(Su , Sw ) = nseCost(Su , Sw ) − seCost(Su , Sw )} if be picked based on directions. For instance, if in a directed graph
(Su , Sw )exist = 1 we have directed edges (a −→ b) and (a −→ c) then b and c can be
(8) one such candidate pair to merge. Also, since our utility function
Given the values of penalties calculated in the previous iterations depends on calculation of importance scores for nodes and edges, it
for supernode pairs (Su , Sw ) and (Sv , Sw ), we calculate utility penal- naturally adapts to weighted graphs.
ties for supernode pair (Suv , Sw ) using Equations 9 and 10. Here,
Suv is the supernode obtained by merging supernodes Su and Sv . For 5. EXPERIMENTAL EVALUATION
example, the seCost of connecting a merged supernode Suv with an
existing supernode Sw is calculated by adding the individual costs of 5.1 Experimental Settings
(Su , Sw ) and (Sv , Sw ). Similarly, we compute nseCost of (Suv , Sw )
by easily reusing individual costs of (Su , Sw ) and (Sv , Sw ) Setup. We perform all our experiments on single Amazon EC2
m4.4xlarge instance with 16 vCPU, 64 GB memory, and 300 GB
seCost(Suv , Sw ) = seCost(Su , Sw ) + seCost(Sv , Sw ) (9) SSD storage. We use Python and create graphs using the Net-
workx [26] library. For certain scalable centrality implementations
nseCost(Suv , Sw ) = nseCost(Su , Sw ) + nseCost(Sv , Sw ) (10) we rely on the networkit [34] library. To scale for the large datasets
Once we have the individual penalty costs of evaluating connec- that barely fit in the memory, we made several programatic improve-
tivity between merged supernode Suv and its potential neighbors, ments to our code. For example, we carefully parallelized the loop

342
(lines 12–16, Algorithm 2). Also, we modified Networkx library to Table 3: Example test criteria
support external memory graph access (read and write). Specifically,
Desirable Property Example Test Criteria
we extend Networkx by subclassing the Graph class and providing
user-defined factory functions. These functions query a database C2: Let’s consider graphs mX Cn , mY Cn , and mZ Cn (X < Y < Z) from
and cache the results in the dictionaries used by Networkx. Figure 4. Let mZ Cn be the base graph and mX Cn and mY Cn be perturbed
graphs obtained by introducing X and Y number of spurious edges to
Datasets. In our experiments we make use of seven real-world mZ Cn . Then according to C2: spurious edge awareness criterion, the
undirected and un-labeled graph datasets. Among them, ca-GrQc, following condition should satisfy:
ca-AstroPh, ca-HepTh, and ca-HepPh are author collaboration net- 
works from the e-print arXiv for Astrophysics, High Energy Physics, EU(mX Cn )mZ Cn − EU(mY Cn )mZ Cn > 0 (15)
High Energy Physics Theory, and General Relativity categories. The C3: Consider weighted barbell graphs ws Bn , wt Bn and mBn . Here ws Bn
dataset com-Amazon has connection between any two products if is a barbell graph of size n with a weight of exactly one of the edges
they co-purchased. Whereas LiveJournal and Friendster are online being s, and the weights on the rest of the edges being r, where s > r.
blogging and gaming networks. All the datasets can be downloaded In this case, let mBn be a barbell graph with a removed heavy-weighted
edge. If s > t, then according to C3: weight awareness criteria, the
from [19]. Table 2 presents the datasets and their properties such following should satisfy:
as size, average degree (Avg. Deg.), density, average clustering
coefficient (Cl. Co.), number of connected components (CCs), and (EU(wt Bn )mBn − EU(ws Bn )mBn ) > 0 (16)
size of largest component (LC). C4: Consider graphs Kn , mKn and Cn , mCn from Figure 4. These four
graphs are equally sized in terms of number of nodes, where Cn has
Table 2: Real world graph datasets relatively fewer edges when compared to Kn . Graph mKn is obtained by
Dataset
removing a single edge from Kn , similarly mCn is obtained by removing
Nodes Edges Avg. Deg. Density Cl. Co. CCs LC
ca-GrQc 5242 14496 5.526 0.1054% 0.5296 355 79.32%
a single edge from Cn . Then, according to C4: edge submodularity
ca-HepTh 9877 25998 5.259 0.0533% 0.4714 429 87.46% criteria:
ca-HepPh 12008 118521 19.73 0.1644% 0.6114 278 93.30% (EU(mKn )Kn − EU(mCn )Cn ) > 0 (17)
ca-AstroPh 18,772 198110 21.10 0.1124% 0.6306 290 95.37%
com-amazon 334863 925872 5.529 0.0017% 0.3967 1 100.00%
com-LiveJournal 4036538 34681189 17.18 0.0004% 0.2815 38577 99.04%
com-Friendster 65608366 1806067135 55 8.4e-05% 0.1623 1 100%
then the utility of GS is defined as:
1
Baselines. We closely study two key works in literature that provide ∑v∈Vk |Sv |
Top-k Query App Utility = (18)
iterative solutions for grouping-based greedy graph summarization. k
First is the work by Navlakha et al. [25] and second is by Tian et In other words, if all the top k or t% nodes from G match exactly
al. [23]. Because [23] builds on [25] and provides the distributed with top-k nodes in GS then the utility score in this case equals 1
solution for it, we implement algorithm discussed in [25] as a base- where each node contributes 1 to the summation in the numerator for
line. We have added experimental results in Section 5.2 (Figure Equation 18 as for that node |Sv | = 1. On the other hand, in the case
7) comparing our results with state-of-the-art grouping-based sum- where some of the top-k nodes are contained within a supernode
marization technique by Navlakha et al. [25]. According to this containing more than one nodes, then each such node u contributes
technique, the best pair of nodes is selected at each step on the basis a value of |S1 | . This fraction (that is < 1) represents the information
of maximum gain. Gain is defined as the extent of compression u

achieved when the selected pair of nodes are merged. To scale loss caused by the summarization process.
this technique, authors select a node u at random and a neighbor v • Application 2: Link Prediction. Another real-world application
within 2-hops is selected that achieves maximum gain when merged is knowing if a given pair of nodes belongs to the same community,
with u. This is repeated until the required compression is achieved. or not. In other words, based on the current community structure,
Since this technique is based on the theory of Minimum Descriptive predicting if there will be a link between the given pair of nodes, or
Length, we refer to this technique as MDL in our experiments. Next, not. To measure the utility of GS , we consider a list of all pairs of
we highlight the key design decisions that we made in our technique 2-hop nodes in graph G. For each pair, we predict a link if the pair
and replace each design decision with its random counterpart to belongs to the same community in GS , and we compare the result
create our other set of baselines. We make two key design decisions with the link prediction on G. More formally, if LS is the binary link
in our technique; first, we compute relative importance scores for prediction result vector for GS , where each element corresponds to a
nodes and edges using the shortest path betweenness centrality met- link prediction result for a pair belonging to all 2-hop pairs, and if L
ric. Second, we select a pair of 2-hop neighbors in the ascending be the result vector, then utility of GS is defined as:
order of the sum of their importance scores. We randomize these |LS ∩ L|
key steps by 1) randomly assigning importance scores to nodes and Link Prediction App Utility = (19)
|L|
edges (RNEI), 2) selecting the pair of 2-hop nodes in random order
(RNPO) while assigning importance scores using betweenness cen- Example Criteria for a Desirable Utility Function to Satisfy. In
trality, and 3) performing both steps randomly (RNEI-RNPO). For addition to the example test criteria described in Section 3, here we
random baselines we report the average of ten runs. provide a list of more example criteria shown in Table 3. These ex-
Evaluation Metrics. We evaluate our techniques using two popular ample criteria based on model graphs in Figure 4 help us understand
real-world applications, measure the application-specific utility, and the properties defined in Table 1, and evaluate the desirability of
compare with our baselines. For each application, we define a utility a utility function. Note that these criteria are not exhaustive and
metric that will indicate the usefulness of a graph summary with other criteria can be devised using the model graphs in Figure 4.
respect to the corresponding application.
• Application 1: Top-k Query. One of the widely used real-world 5.2 Experimental Results
applications is the selection of top-k or top t% of nodes, where the Current-flow and Shortest Path Betweenness Centrality-based
goal is to rank nodes using the Pagerank algorithm and select the Utility Function Satisfies All Desired Properties. We start by eval-
top k nodes according to their ranks, in descending order. Given uating the suitability of various centrality metrics that can be used
the value of t, k is derived as k = |V | ∗ t% for G. Whereas, for GS , during the calculation of edge importance scores, and form a utility
k = |VS | ∗ t%. If we run Pagerank on both graph G and its summary function that exhibits the desired properties described in Section 3.
GS and Vt% be the set of top-k nodes in G based on Pagerank values, Generally, the relative importance of each edge in the graph G is

343
Dataset: ca-GrQc Dataset: ca-HepTh Dataset: ca-HepPh Dataset: ca-AstroPh, Dataset: com-Amazon Dataset: com-LiveJournal Dataset: com-Friendster
Top 10% nodes selected Top 10% nodes selected Top 10% nodes selected Top 10% nodes selected Top 10% nodes selected Top 10% nodes selected Top 10% nodes selected
Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility


UDS
1.0 UDS
1.0 UDS UDS
1.0 UDS UDS
1.0 UDS
0.8 RNEI RNEI RNEI 1.0 RNEI RNEI 1.0 RNEI RNEI
RNPO RNPO RNPO RNPO RNPO RNPO RNPO
RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO
0.6 MDL MDL MDL MDL MDL MDL MDL
0.5 0.5 0.5 0.5 0.5 0.5
0.4
0.2
0 0 0 0 0 0 0
0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0
Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN)

(a) (b) (c) (d) (e) (f) (g)


Dataset: ca-GrQc Dataset: ca-HepTh Dataset: ca-HepPh Dataset: ca-AstroPh Dataset: com-Amazon Dataset: com-LiveJournal Dataset: com-Friendster
Top 30% nodes selected Top 30% nodes selected Top 30% nodes selected Top 30% nodes selected Top 30% nodes selected Top 30% nodes selected Top 30% nodes selected
Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility


1.0 UDS
1.0 UDS UDS
1.0 UDS
1.0 UDS UDS
1.0 UDS
RNEI RNEI RNEI RNEI RNEI 1.0 RNEI RNEI
RNPO RNPO
0.8 RNPO RNPO RNPO RNPO RNPO
RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO
MDL MDL 0.6 MDL MDL MDL MDL MDL
0.5 0.5 0.5 0.5 0.5 0.5
0.4
0.2
0 0 0 0 0 0 0
0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0
Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN)

(h) (i) (j) (k) (l) (m) (n)


Dataset: ca-GrQc Dataset: ca-HepTh Dataset: ca-HepPh Dataset: ca-AstroPh Dataset: com-Amazon Dataset: com-LiveJournal Dataset: com-Friendster
Top 50% nodes selected Top 50% nodes selected Top 50% nodes selected Top 50% nodes selected Top 50% nodes selected Top 50% nodes selected Top 50% nodes selected
Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility

Top-k Query App Utility


1.0 UDS
1.0 UDS UDS
1.0 UDS
1.0 UDS
1.0 UDS
1.0 UDS
RNEI RNEI RNEI RNEI RNEI RNEI RNEI
RNPO RNPO
0.8 RNPO RNPO RNPO RNPO RNPO
RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO RNEI-RNPO
MDL MDL 0.6 MDL MDL MDL MDL MDL
0.5 0.5 0.5 0.5 0.5 0.5
0.4
0.2
0 0 0 0 0 0 0
0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0
Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN)

(o) (p) (q) (r) (s) (t) (u)


Dataset: ca-GrQc Dataset: ca-HepTh Dataset: ca-HepPh Dataset: ca-AstroPh Dataset: com-Amazon Dataset: com-LiveJournal Dataset: com-Friendster
Link Prediction App Utility

0.9
Link Prediction App Utility

Link Prediction App Utility

Link Prediction App Utility

Link Prediction App Utility

Link Prediction App Utility

Link Prediction App Utility


UDS RNEI-RNPO UDS RNEI-RNPO 0.9 UDS
RNEI MDL
0.8 RNEI MDL 0.8 0.8 RNEI
RNPO RNPO
0.8 0.8 0.8 0.8 RNPO
RNEI-RNPO
0.7 0.6
MDL
0.7 0.6 0.7
0.6 0.6
0.6 0.6
0.6 0.4
0.4 0.4 0.4
0.5 UDS RNEI-RNPO
UDS RNEI-RNPO UDS RNEI-RNPO
UDS RNEI-RNPO 0.5
0.5 RNEI MDL RNEI MDL RNEI MDL
RNEI MDL
RNPO RNPO RNPO RNPO
0.2 0.2 0.2
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN)

(v) (w) (x) (y) (z) () ()


Figure 6: Experimental results demonstrating effectiveness of UDS design decisions

assessed by measuring the degree of participation of edges in com- Moreover, certain centrality metrics calculate centrality scores
munication between distinct parts of the network. This leads us to for nodes and cannot be directly used to calculate edge centrality—
the notion of betweenness centrality. The most common between- e.g., centrality metrics that are based on Pagerank, Eigenvector [27],
ness centrality metric is based on shortest paths, where the centrality Communicability [6], Communicability Betweenness [7], etc. Here,
of an edge e is essentially an average number of shortest paths con- we treat the node centrality scores as node importance. Also, by
necting all pairs of nodes in the graph that pass through edge e. intuition, we assign importance scores to edges based on the impor-
There are some drawbacks with this approach. First, it takes into tance of the nodes they are connecting to—i.e., we assign an edge
account only the shortest paths and ignores the slightly longer paths. a high importance if it connects any two highly important nodes.
Edges of such relatively longer paths are critical for communication Using these node-based centrality measures, we estimate an edge
in the network. Second, the actual number of shortest paths that lie importance by summing up the normalized centrality scores of the
between the source and destination is irrelevant. In our case, it is pair of nodes it connects, and normalizing it. Let’s denote utility
reasonable to consider the abundance and the length of all paths. functions based on these centrality metrics as PRU, EVU, COU,
Knowledge about the importance of each edge to the graph structure and CO-BCU. We compare these utility functions and evaluate their
is enhanced when more routes are possible. effectiveness using the model graphs (shown in Figure 4) and our
In order to take such paths into account, Current-Flow Between- ideal utility function properties. Table 6 demonstrates our evaluation
ness centrality can be considered [2]. Here, the graph is imagined results. Red cells or non-positive values indicate a violation of a
as a resistor network in which the edges are resistors and the nodes corresponding property or criterion. Results show that CF-BCU
are junctions between resistors. Accordingly, the current-flow be- and BCU obey all the formal required properties (C1-C4). Bold
tweenness of an edge is the amount of current that flows through it, values represent max values that are highly discriminatory for each
averaged over all source-destination pairs, when one unit of current test criterion. We find CF-BCU to be most effective and highly
is induced at the source and the destination (sink) is connected to the discriminatory. Each row of the tables corresponds to a comparison
ground. Let’s denote shortest-path and current-flow betweenness between the similarities (or distances) of two pairs of graphs; pairs
centrality-based graph utility functions as SP-BCU and CF-BCU. (A,B) and (A,C) for property (C1-C3); and pairs (A,B) and (C,D)

Table 4: Practicality of utility EU with respect to an application of top-k query


Datasets
Application 1
ca-GrQc ca-HepTh ca-HepPh ca-AstroPh com-Amazon com-LiveJournal com-Friendster
Top % Nodes Pearson’s r Cos. Sim. Pearson’s r Cos. Sim. Pearson’s r Cos. Sim. Pearson’s r Cos. Sim. Pearson’s r Cos. Sim. Pearson’s r Cos. Sim. Pearson’s r Cos. Sim.
10 0.9475 0.9822 0.9569 0.9939 0.9453 0.9943 0.9835 0.9976 0.9329 0.9969 0.9289 0.9743 0.9448 0.9738
20 0.9232 0.9828 0.9709 0.9947 0.9438 0.9330 0.9398 0.9965 0.9628 0.9964 0.9519 0.9127 0.9474 0.9528
30 0.9403 0.9855 0.9561 0.9936 0.9488 0.9249 0.9654 0.9930 0.9794 0.9864 0.9832 0.9287 0.9527 0.9803
40 0.9505 0.9942 0.9565 0.9969 0.9428 0.9308 0.9925 0.9921 0.9877 0.9970 0.9328 0.9267 0.9378 0.9747
50 0.9280 0.9912 0.9864 0.9925 0.9426 0.9322 0.9987 0.9869 0.9893 0.9734 0.9737 0.9725 0.9735 0.9826

344
Table 5: Practicality of EU with respect to a link prediction application summaries with significantly higher utility compared to baseline
Application 2
ca-GrQc ca-HepTh ca-HepPh
Datasets
ca-AstroPh com-Amazon com-LiveJournal com-Friendster
techniques—thus demonstrating the effectiveness of our design deci-
Pearson’s r 0.9424 0.9306 0.9950 0.9910 0.9259 0.9264 0.9371 sions. Figures (6a)–(6s) show results for the top-k query application.
Cos. Sim. 0.9657 0.9940 0.9999 0.9927 0.9863 0.9957 0.9873
Figures (6v)–(6z) show results for the application of link prediction.
UDS Performs Well Compared to State-of-the-Art. Figure 6
for (C4). However, the calculation of current-flow betweenness shows the result of our experiments where we compare MDL ap-
centrality CF-BC is computationally intensive and does not scale proach with UDS with respect to top-k query and link prediction
even for graphs of a few thousand nodes. Hence in our experiments applications. Our approach consistently performs well when com-
we use of shortest path betweenness centrality SP-BC that scales pared to MDL. We attribute this result to the fact that MDL approach
well for larger graphs. Also, from our results we note that, similar to does not optimize for the preservation the important regions of the
CF-BCU, the utility function SP-BCU also exhibits all the desired graph as UDS does, hence tends to lose on summary quality.
properties. Moreover, it has been shown in [28] that compared to UDS Provides Attractive Trade-Off Compared to LOPT. Theo-
other centrality metrics, SP-BC is strongly correlated with CF-BC. rem 1 implies that it is hard to find approximation factor or compare
Hence, it is beneficial to trade slight loss in quality to significant our solution empirically to global optimum. Nonetheless, we com-
improvement in performance. Finally, we compare simple graph pare UDS where we pick best node pair to merge in each iteration in
edit distance (GED) with our utility functions. Result shown in last O(1) time with local optimum LOPT where O(N 2 ) comparisons are
column of Table 6 show that GED violates all the desired properties performed to pick best node pair to merge at each step. We present
except C2. Hence GED is not fit as an utility function. results in Tables 7 and 8. We perform two experiments. First, we
UDS Judiciously Exploits High Compressibility of the Graphs. compare UDS to LOPT with respect to the reduction in summary
In the first experiment, for each dataset we vary RN from 0.1 to 1 size relative to original size returned for a particular utility threshold.
and apply UDS to analyze the incrementally calculated utility EU. For this experiment, we generate Barabasi-Albert random graphs
Figure 7 shows the result. We compare the UDS approach with the with parameters n (graph nodes) and p (preferential attachment). As
random 2-hop pair selection for merging at each iteration (RNPO). we increase p from 1 to 5, density of graph increases. Table 7 shows
There are three key takeaways from this experiment. First, the non- that UDS performs very close to LOPT for sparser graphs and for
linear relationship between EU and RN (shown in red) for all the denser graphs quality of solutions reduce by atmost 25% compared
graphs indicates relatively high compressibility of corresponding to LOPT. But for the given performance (O(1) compared to O(N 2 )
graphs. Second, we observe that denser graphs have higher com- per iteration) UDS provides attractive trade-off compared to LOPT.
pressibility when compared to sparse graphs. Third, UDS smartly Second, for a particular iteration, we compare UDS’s choice of best
exploits compressibility of graphs by preserving important regions pair to merge compared to LOPT’s O(N 2 ) choices (sorted in ascend-
of the graphs at relatively higher RN when compared to the UDS ing order of cost). For example, Table 8 reports 0.1 if UDS’s choice
approach with random 2-hop pair selection. We omit results for is within top 10% of LOPT’s choices. Table 8 shows that across the
LiveJournal and Friendster datasets as we do not see different result. iterations and for the random graphs (p=5, utility=0.5) of various
UDS Design Decisions are Effective. Next, we evaluate the effec- sizes UDS’s choice is within 10% of the LOPT’s top choices.
UDS’s Estimated Utility (EU) is Practical. We compare EU with
various application-specific utility values for varying RN to assess
1.0
Dataset: ca-GrQc
1.0
Dataset: ca-HepTh
1.0
Dataset: ca-HepPh
1.0
Dataset: ca-AstroPh
1.0
Dataset: com-Amazon
the practicality of EU to be used as an approximation for various
Estimated Utility

Estimated Utility

Estimated Utility

Estimated Utility

Estimated Utility

0.8

0.6
0.5 0.5 0.5 0.5 application-specific utility metrics. For each real-world application,
0.4

0.2
UDS
RNPO
0
UDS
RNPO
0
UDS
RNPO
0
UDS
RNPO
0
UDS
RNPO we calculate the Pearson’s correlation coefficient r in order to mea-
0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0
Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN)
sure the strength and direction of a linear relationship between EU
(a) (b) (c) (d) (e) and the application-specific utility, for varying RN from 0.1 to 1.
Figure 7: UDS judiciously exploits high compressibility of Graphs Since EU values are in the range [0,1], we use Cosine similarity
between EU and the application-specific utility in order to measure
how closely related they are in magnitude. Table 4 shows the re-
tiveness of the two key design decisions that we make: 1) calculating
sults for the application of top-k query where we can observe that
relative importance score for nodes and edges, and 2) the order in
correlation between Pearson’s correlation coefficient r and EU is
which a pair of nodes is selected to merge in each summarization
significant, with the value of r very close to 1 in almost all cases, and
step. We compare UDS with the baselines using two real-world
p-value < 10−4 . We observe similar result also for an application
applications discussed in Section 5.1. In this experiment, we vary
of link prediction as shown in Table 5. Hence, EU is practical.
RN and calculate the application-based utility metric with UDS and
related baselines where we randomize selected key design decisions. UDS Scales Near Linearly with Varying RN. Figures (8a)–(8e)
For the top-k or t% query application, we vary top % of nodes demonstrate that UDS performs well across all datasets, for uni-
selected and measure application based utility for each dataset’s formly increasing RN. Specifically, UDS exhibits near perfect linear
summary. Figures (6a)–(6q) show that across various datasets, pa- scalability in the case of datasets with relatively higher density and
rameter values, and applications, UDS consistently results in graph average clustering coefficient (ca-HepPh and ca-AstroPh). On the
other hand, in the case of relatively sparser datasets, RN values in
the range 0.1 to 0.6 the cost of iteratively merging nodes remains
Table 6: Evaluating various centrality metric-based utility functions and relatively negligible when compared to the fixed cost of calculating
comparison with simple graph edit distance based utility (GED) metric node and edge importance. Overall, high correlation with linear best
Test Graphs PRU SP-BCU EVU COU CO-BCU CF-BCU GED fit and R2 values confirm our scalability conclusion.
Criteria A B C EU(B)A − EU(C)A
C1 B10 mB10 mmB10 0.005 0.019 0.005 0.004 0.03 0.13 0 UDS Visually Simplifies Complex Graphs with Guided EU. We
C1 L10 mL10 mmL10 -0.009 0.04 -0.03 -0.03 0.03 0.3 0
C1 BC11 mBC11 mmBC11 -0.009 0.00002 -0.007 -0.013 -0.005 0.032 0 also conduct visual validation of our UDS approach. For this ex-
C1, C4 WhB12 mWhB12 mmWhB12 -0.0002 0.04 0.001 -0.006 0.02 0.063 0
C1 WhB12 m2WhB12 mm2WhB12 -0.0003 0.08 0.003 -0.013 0.041 0.127 0 periment, we query graph summaries with a specified EU, rather
C2 m2C5 mC5 C5 0.023 0.023 0.023 0.023 0.023 0.023 0.33
C2 mmK5 mK5 K5 0.5 0.5 0.5 0.5 0.5 0.5 0.125 than RN. Here, we visualize the ca-HepTh graph and its summaries
C2, C3 mB10 w2B10 w5B10 0.095 0.095 0.095 0.095 0.095 0.095 0
C2, C3 mB10 B10 w5B10 0.132 0.132 0.132 0.132 0.132 0.132 0 for varying EU values. Figure (9a) shows the input graph. We
Test Graphs PRU BCU EVU COU CO-BCU CF-BCU GED can observe that the input is a disconnected graph with many small
Criteria A B C D δ = EU(B)A − EU(D)C
C4 K5 mK5 C5 mC5 0.1 0.099 0.1 0.1 0.1 0.2 0.1 components and one large connected component. Figures (9b)–(9d)
C4 C5 mC5 mC5 m2C5 0.08 0.095 0.088 0.07 0.03 0.142 0.05

345
Table 7: Comparing UDS to LOPT based on summary sizes for a given utility threshold.
Pref Attachment → 1 3 5
Utility Threshold → 0.9 0.7 0.5 0.9 0.7 0.5 0.9 0.7 0.5
Number of Nodes ↓ UDS LOPT UDS LOPT UDS LOPT UDS LOPT UDS LOPT UDS LOPT UDS LOPT UDS LOPT UDS LOPT
1000 0.805 0.975 0.90 0.97 0.95 0.97 0.30 0.40 0.50 0.75 0.72 0.92 0.12 0.27 0.36 0.55 0.67 0.80
5000 0.78 0.97 0.96 0.97 0.98 0.98 0.43 0.52 0.58 0.78 0.79 0.93 0.22 0.38 0.47 0.69 0.68 0.86
10000 0.91 0.99 0.94 0.99 0.98 0.99 0.48 0.57 0.66 0.80 0.77 0.92 0.31 0.40 0.51 0.67 0.66 0.84
Dataset: ca-GrQc Dataset: ca-HepTh Dataset: ca-HepPh Dataset: ca-AstroPh Dataset: com-Amazon Dataset: com-LiveJournal Dataset: com-Friendster
15
110
2×106
Running Time (secs)

Running Time (secs)

Running Time (secs)

Running Time (secs)

Running Time (secs)

Running Time (secs)

Running Time (secs)


50 6000 30,000
2.4189 + 11.809*x (-2.2194) + 53.761*x 100 58.544 + 57.86*x 200 67.143 + 168.48*x (-1858.4) + 7709.2*x (-7546.7) + 40943*x (-374778) + 2.4227e6*x
Correlation: 0.92874
40 Correlation: 0.93476 4000
Correlation: 0.99 Correlation: 0.98712 Correlation: 0.92073 Correlation: 0.94697 Correlation: 0.95262
10 R2: 0.86256 R2: 0.87377 R2: 0.98 R2: 0.97447 R2: 0.84775
20,000 R2: 0.89676 R2: 0.90748
30 90
150 1×106
2000 10,000
20 80
5 10 70 100 0 0 0

0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0 0 0.5 1.0
Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN) Reduction in Nodes (RN)

(a) (b) (c) (d) (e) (f) (g)


Figure 8: Scalability of UDS

Table 8: For a particular kth iteration, comparing UDS choice of best node to minimize the reconstruction error, while completely ignoring the
pair to merge compared to LOPT top choices. preservation of important parts and regions of the graph. Purohit et
k-th Iteration al. [30] study the diffusion and propagation processes where they
10 20 30 40 50 1 propose to merge two adjacent nodes such that the coarsened graph
Number of Nodes UDS choice w.r.t LOPT Top % Choices
retains its diffusive properties. They do not consider the possibility
1000 0.56 0.54 0.11 0.10 0.01 0.26
5000 0.84 0.71 0.54 1.31 0.86 0.36 of merging nodes that are not directly connected but share majority
10000 0.95 0.93 0.84 0.81 0.67 0.61 of the neighbors. Their limitation of not considering indirectly con-
nected nodes prevents them from exploiting various opportunities
to achieve high compression. Navlakha et al. [25] propose a highly
459TAFNYZT

18786
65805

30559
50943

58633
C7K6DZG8N0

50481
17045

7400
42501

68000
55837

38931
10660

1293
64615

26469
63367

10949
ZIV9N8ACLJ

52544
41424

23179
65029

25503
56568

15467
26613

29929
55682

64599
10319

22784
63856

27461
57600

39128
LR7SEUXI01

52435
1985

6494
55807

31882
50568
11913
27128 65194
compact two-part representation of a given graph consisting of a
22844 17045 42501 57002 55837 24125 375 10660 21392 58075 64615 63367 TDGTITQK0C 65314 41424 50915 65029 56568 36072 26613 RST4TWBWF2 55682

graph summary and a set of corrections. The corrections portion


23220 52690

25655
62493
22057
37108
5919

26860
36767
21952

16320
19488

52313 23420
37538
5745 955
8621 12310 48299
56227 63572
67652 43425 21141
755924
233 61879
WFL7ZKEHDW 50149 20692 64142 64826
48073 25982 67757
39773 34989 27499 36679
63009 63423 32205 3423 55119 61147
39588 26552 33204 52435 23605 40665 23807 37999 2665 62452 43895 34310 6595 23062 2716 48082 55704 50753 62081 65597 13416 40019 67778 27048 66496 59514 24788 1227 8851 49642 45226 35204
2600

1441
24986 54915 15587
4014 1 5 3 9 14324 6677831398
20599
19318 5 7 0 241221 25239 50786
51376
41896
18581 52149 31452 5 0 22716
29 24478
VXGVA732V7 16SGXYT2GQ ZJ1B03MD0U HZV3NHQ4PV I60XZKZA2O 0WWVB7QPMA E3RWZT8WOH IKV09X4LB1 EKAOO6J0IH U289Q9W0S4 WM4RWHZTVZ 05U1UF1ZX8 9C5TPAR2T8 WE6AJ4UDF6 5EI8VDTZBU S6NZCA0FN0 L5J9H7M7RJ D0VX9C1X31 VHD1V6YCJW EBXHT1BO1L FYOO0I6ULW 8Q9XEJ5WP1 886RRVTMCX 26849 16226 51300 22356 51382 14642 35455 1481
59789 27652
56948 12143
20773
17066
3682
17289
38614 15946 42162 41588
21130 17667 44977 52622
33650 68204 37417

643
2128640803 11683 3088
14823 37932
24824
32749 19365 105976948 33426
17141
1833911913 61977
43171
45736 51548 15604 63786 49295
40625 33512 43864 29373 34735
13794 571251627823420 30766 6 7 3 8 9 7 4 6 1472618109 42995

59600
58965 62404 4 0 714176
0 26596 26363 39723 414
16570 29595 54987
12639 1441 7859 38203
3U45H23AEA
34033 3782
57365 24086 9 6 7 5 5315935794
B6QDJ68YIU 28950 26304 35725
3 5 7 2 66820 13496 48973 67701 34926
49254 59471
62785 46745
97 3354326913 51028
1703033617 33715 425814819215295 554 34364
60604 55610 4051718956 47317
30545 48724 64477
53601 3074467887 6592264813 37616
EKOSS29DME SBV6K2GWCT
43579 31668 54785 51962 41687
3989 34958 10677 35424 209949 4 4 3 56759
31944 13648 5338
55814
50568 28633ZBAZR68U5S 61742 27122 28268
20799 23282 46660 54556 1 9 3 1 65086
21858 4223
34628 54465 43226 10717 62227 654519
142 66349
61561
58486 46064 83 29
33755
21044
3306 16164 44515
1 2 6 8 33090 48387 21380 45844 X9HXG8NUH9
33969
16026 702 6435 56120 66017
36383 54218
11403
35630
27587
40396 9319 35336 29509 26847 60171 36860 51659 4436
40852 1279914017 64971
37259
37152 19974494183825321669 34787 66135
63791
924 60926 24394

43226
19666 36593
57878 29715 45107
59400
P7YE9NTIBU 59834
543695 1 3 8 30160 10170 16619
63697 38730 36103
66739 22838
7701 13425
24117 6 8 3 1
63543 RDZSUE78EZ 8611
52681 61475 30254 47283
20041 32992 17575 9 2 8 1 36578 64891 47116
27447
53238

19615
51831
50417 FWSJ7AFQYV 2992863850 6975 44934 47509
3466 13941
32081 37780
28753 44262 63113 45319
GTTIFW32LU
1D2V8GD3L7
9969 63504 35606 19615 45385 64950
42497 30778 53132 9390 5263 20394
27932
W6877F6PU1 4912
5346 27260 36322
19268 46139 7 0 4 8 36010 25729
38089 11599
65553 1382415826 27910 65486
M8BN5RW52I
60720 7974 7648 8660 45692

32466 20006 15267 7400 68000 15846 38931 66833 57549 1293 43284 25571 26469 10949 52544 39130 23179 64354 25503 15467 20036 29929 34310 64599

specifies the list of edge-corrections that helps to recreate the origi-


T2I01V1H6G
45848 21518 65168
QCM7524944 JBHYWMK6RG
10947 VXAEF3U9W1 28083 67569
47946 36600 23704
18680 54807
S2FM35P9J0 12511 54042 24059 35835 4387
49435 13158 68298 37174 3RJK4WLOH1 N0XFMMROAJ
17793 55787 59600
4318 02MLLDAF44 553196 4 3 52419
23980 23ED51EZ9K 65384 36180 63462 1240017370 66689 55387 24295
4920 46511
RUX3PIDLSS 2410 59077 58006 14456
24697 58588
15DYDHP415 23066
29184 34546 56663 35910
62261 11237 49300 HTEOWL4GN5
8QD7Q2VT24 51108 43290 1166061323 4368619893 53907 22552 48570 2934
50306 YHMONR8SOF 6046
5583
QU6UBRORVV
32597
37364 31583 43974 38682
60575 50243 40989
2HMDG70AXH DBTN9X26O2
GNGD6M2ISH
8794 5384 39127 7SNCAZ31XP2 120218
11 4083
66882 1035
52240
LDI0BEG0JO
16064
40580
5999 59651 25465 65013
68711 6366
33072
47233
7815 DEKNXFWPP0 53323 14657
46903 56886 39320 43779
52925 12334 18844 67851
8985 31164 21255
3780
20104
31517 ZOP1C4QXGN
42632
893 17304
23617 3093 19720
34871
7172 66440 66288 26122
53642 26460 CGB6ARYYB8 31420 21251 2044 36913 32248 26629 1185020253 31145
8257
27094 14225
27698 29256 32017
23942
20457 Q4IDO4KTKS
41610
GCISYFVLEZ
5392 67933
TZKR8JGQ0V
6749317838 2270 27822
47175
25368 5919
41362 37050
3581
38341 39606 YQ9X1R5ATX
40221 4695068111 3624
27967
68417 40584
8978
39125 57165 JZVFLK4G5Y
52725
27RIOP9AG7
5LR6QHUGNI 63195 23643 29763 44983
59141
57820 62541
44267 44332
46569 50943
ANHCQZD1AR NZF7ZUK0TN
6394 28759 3623726541 8887 65692 22498 58969
J1QVEW2575
6541 61201
45323 V5YQ9CAO9P 1 9 3 2 33253 21934 61556
16641 62435
46624 16679
60301 44683
PKD19L2VPE
38327 17887 43076 10653
23525 12867 2 7 9 0 12190 46541 8216
62DRRIBPGR C8HWCKXJQ7 10732 NFDQW49Q3W 10640 53241 A4P1PJ9ZD8
41853 3750 34801
49299 43318
H382BUQWH6 13742 49667 4883014125
K1EN9BHVZ8
LD2Q02QY4Y
64225 55657
37219 66951 4FH3NTLRMX 54079
14046
22110 PCL5Q224N6
45402
5461 3WUO7FIZCO 58738
8 8 5 4 57259
38742 43731
3266 41419 59922 3469 4918
62370 25481
6801
40348
58458 55740 4XU84ZG0UU 7360 51176 66598 57920
47344 13370 15778
46360
9X7ZXBZR9Q
46975 46804 24970 48605 1141 15023 13884
IVZFROKCQD 28376
8XQT5IYRSV 7353 30553 12900
QY0R8776DC 42725
54988 47358 40104 16696 56177
64210
50655 52193

54915
7139 13016 56518 47574 16349 3765 53282 19282
27602 58192 PTSIP5D11Q 55574 49022
KG7GFNWGNE
K9A5OUITPN
1ESI6YSUY7
59697
56800
62660 JLJLWFFGZ7 3 8 3QYK0NKBSYB
2 32660
MNKL6HH546 PVT2A6LCWI
52506 6168 25273
62466
9768
30869 21813 34288 47409 FJXQGFI6BR 13451 U2OVFOJZG4
47525
52933 IFBYVM5JC9
5975 31312 44275 7295
2087435215
52012
45157
55277
43905
7393 44550 38016
49449
10960 20152
41305 64631
56299 28579 51099 40390 2891 50702 QFEG7R8UTA
60421 28152
40434 68432
61448
344644 8 9 2
09YJG58K12 22933 36143
64553 49509
17555
8727
34005 42610 53426 43993 7820 40592
23316 64702 50434
537526 42019
3928
38829 43908 45643
16586 49473 61635 GIXGMDGZYG 63142 31668 6DAYTMQKSY 10523 51036 9V14MFTHFM
242 36235 18882 35132 13700 53439 40947
58069 11141 56274 20549 66441
4V6RAMC5O7 51215 3315 61582 38074
Q5V5X483XA 50323 32235 7 7 4 166936 62HJ8SYF00 0 7 839264 9775
39176 4132 61889 47877 59722
50572 43018 53616
64167 58 M6GBWJWODA
30 D3960FYFPZ
56298 20494 40655 5709 39239 53733 31969
21489 52956 3015 58851 ZOGHHOT28M
42529 47430
56834 931
10661 35338 16233 25493 41696 51844 41392 7 6 9 6
65805 30022
10579 BDSSCU4IDK 49382 JQZRGXDNKC 28497 L1ST531YB3
61811 12984
63412
28284 5 8 8 8 63875 17034
7LVY50I73D 63618 7871
31422
9686
DT14JLBH54 30407 38052

36383
38734 18656 58162
9305 E9PUECN1NX 16291 49016 18242 4546056798 50324 51070 19469
Q7V8Y98QEE 42528 17408 16810 39044 31098 8404 21776 59283 44990 13840 39144
51207 59941 24443 28194 41958
30451 7 3 916206 53510 2079
T8G6BPD9W1 60863 64426
50922 15662 47770 31959 48733 21807 40524 24418 55951
EFDDX3M7J1
55133
46551 54164 7 2 9 617801
55655 47339
WS1WW1SWNI 50748
5968
5687240750 42698 39190 63272 22198 65359
62636 57295 31328 19669
42365 11231 36223
52516 7590 43907
47458
63216 2499012497 17081
47217
23594 KUV0KGH8TS 49225 58994 20417 49172 40595
27174 4793
8 7 9 1 16448 35054 26641 4 6 9 0 10509 23037 OGTJHADS2C 40905
55707
29965 GCW1MEE7CP 21312 56679
34288
25385 52053 23961 18424 3046 30077
27468 31076 14509 45648 16689 38059 28175 52749
314116
00 1716220529 10906 25397 2704 8484 57050 67507
13140 29319 63991 12590
66154 5914 8 1 9 358898 15612 65394 68457 49165
18581
49190 52913 55471 29862 3413131240 24181
20692
3254843144 7367 1767 21418
63279

25VVPUTHCB
51827 18147 6561 57084
23203 43169 64020 34677 5747359908 31156 29935 60645 22126 2DYIBEK3DG Z7IJ0DSN79
PARWB19YWV 249 27973 54042 198937 9 8 8 25682 14023 51482
50575 50403 27737 61839 39469
51374 3011 12689 192686 47146 51293 56712 25799 63586
6U20OM07JG 63904 410553
43687 7 4 5 4 6859 U3X84MYPOA
53797 31583 56608 49509 723 67406 26176 48198 V0A1RRJ0BV CJ9IULH27K
22002 14103 15388 50270 888 3520
12462
65252 58599
24805
8365 16225 42814
20773 12310
4D9XP5X0RF
19142
43576
33866
1511
48624
52992
64594
49566
10474
34173
63355
1902816619
45632 Z3E3ET6BNY
17300 3 1 8 8
4 2 860830
4 40061 57794
58466
43371 68329 18432 12357 6392 3142 9821
2074 26196
25992
6515
60929 44316 26563 8 5 5 8 44762 61262 16921 622 435 34264 34189 31059
U0EK3TR9KT 38434 13107 10278 33856 2 4 5 96 27855 25465 46405
11452 8 1 850642 66247 6210 25709 46402
20864 17198 24144
6266 46857 36255 9887
5AKD2P8AF3 1653 33179 5390 46660
86UH7STWQ8 256288 3 3 9 58222 26574 40419 52889 22932 1340833384 16711 43776 62437 51074 26929 37571 34566 28986
5SVZHUUTJ3
300172 7 0 1 27073 36800 59682 48604 54012 60374 49299 50627 57920 68631 40536 43481
52761 17393 49018 68477
24558 44305 58987 9 1 6 7 2 1 7 215100 8I1HEISKWM
456320
1 3 410108
37799
61323 4946667983 7060 36619 9 8 0 45290114653 YZ3ZBF4ZXO
2538823282 7268 39728 36071 4 1 4 1 4104330193
60338 28253 29334 28356 6504550814 1K7UGWEEC6
RXCDOLDIQ7 46865 25343 36805 11105 4 447551
36 36949 42886 47610
20409 19275 38924 7BGEQB0HGO
6521339202 2 5 5 5 48634 51783
58019 28837 17596 68057 60056 52766 40423
53627 33919 15910 28451 17543 3000 4EOZ0PMHIF K2M7V8LQUG
54112 G6OR2U1RX0 4278866382 61999125577 1 7 2 7974 66466 66621 19712 15439 56855 23514
45226 54310 50878
213263025464068 840 1 2 25702 16775 26122 24645 6 4 6 54710
3 31178 22280
52274 8 1 316211
12641
61407 V364AL4NEE 14075 45169
7 7 3 6 41601
35119 334 53746 24725 49166
42889
49544 53086 15460
40719 41588 47074 11281 17894 56623
6 2 1 8 50996 Z7WRMRTW0I XVAVFNFVF2 RXN39WWEDC 2 1 5 20859
3744218413 47822 43156 51901
48829 49948 31807 43974 56424 30359
29928
35835
17983
51791 57173
38150 27154 3873 62370 4127 8 2 3 34141 40053 37191 773 28086
Q14SK0JO6H 51615 7 4 9 1 65422 49804 4 8 44370
7 6 10136 2410
66598
51808 52747 68537 68265 30911
22057
54086 116 28879 28753 68367 65570 31145 20125
30465 42326 1306
12576 2 6 1 9 65341 36679 48073 4303738315 61556 46343 39523
39807 38964 32408 50182 6008465621 36637 47642 52259 36890
3150 64917 7 6 3 12283
415503 924
60885
4700 19224 61005 33342 21178 38818 40691 4446 12117 13462
60030 8703
1163 9494 969 11663
62255 42186 3964751706
47827 53707 146429 49978 52700 44350 14831 37876 60948 44359 55964
19105 51147 3890 1602956482
3 6884 62493 36209 24986 248089 4 1 6 12400 KL8VMMSK67 17495

44515
42804 19835 5201 50673
54717
10535
14783 54268
17075
26478 6448 46919 54465 21518 5908940272 13824 34546 10045
36505
32109
54573 16535
33151 25929 63504
2971722035
2829
65168 39959
44803 32448 1 1 2 7 39076 9558 0F5BA1JGV2 49967 22284 48393 0DBF5FF1BJ 14405 HV27PFBRS0 36116 WC2WXRK4EW 38861 28045 10804 VQRY5WQVM1 22116 20801 40392 743 49108 66298 26735 9941 5890 62052 64387 22573 53521 22257 13366 2716 10410 40751 54263 47051 52763

33715
32638 42304 33892 3 0 2 64391
14847 43136 4758050300 23420 54375
48125 30313 61364 9935 30829 11977 TZ6KPSMM8U
10180 56475 63901 51989 317023175756 2 453942 43686 2934 47712
2061 22948
27046 2432 58267 19818 27876
64624 30377 55811 46909 48076 46419 52009
40130 CWU5B5XDMP 10829 42318 8 7 2 2 2260545339 48729 44782 57878 12697 3 5 4 2 29214 44262 65842 18910 39929
14394 58376 66603 32171 52121 4979 33430
2 1 6 8 67108 2 9 7 1 2 8 3 32935
35941 11850 52986 4911850736 67569
25789 18865
2893 26969 32655 3481018498 66664 39582JKGP96TCD8
15678 17961 38145 60221
42880 52320 150295 66505 4303 67870
FOAROYRSHI 1737 20279 65930 42516
63805 25 57323
28042 47317 43517 9 1 5 6 42323
67925 42198
53349 62418
26220 59169 46262 4660 65368 56089
16558 59512 27799 44567
135861 1 9 5
21189 14975 1068347677 50330 32715 58006 16671 9502 58147
F3UDT408F9 49865 47747 32773 18665
24117 6 356305
6 6 42725 50109 22716 58709 25590 55180 504HTU75V84ARELUBX5V
45562 16831
OWAGYTNPYN 67652 66165 62905 7233 38038 62404 15135 ODILF6P7H7
51183 12890 42935
60453 54802 58529818297 3 518339 10500 8 0 2 51 3 655610 1003 30APC7CT37 372956 4569953 37259 21790 52622 16796 16821 14425
2511 6 8 7 3 53874 66144 5647317470
51512
72 47226
1128612334 24178 43879 7 9 2 4
52883 17755
46464 44427 60324 33135 50445
4739046338 48387
9 0 7 834820 39231 39182 Q47EVKUNSR
3603956725 53925 42770
13960
21549 40221 10402 14381 58965 65013 39835 23345 19091
13198 22189 64100 8 0 7 0 15826 56382 21247 10840 13550 25524
6466 17066
14774 61950 27122 54696 43034 48427 60618 64383 3 0 2 9 54948
11969 5349823018 37077
26793 2442 46718 60956
37286 24697 60546 22133
29725 17838 29959 61222 65516
SZNEEM3I84 46557 9064 44883 19491 516427 156 30068 10653 39378 51890
68128 36340 52373
QGRHBA7851 M9FJJEDLXA 38127 11513
3533643016 53601 60569 39861 63462 55319
57820 49254 25655 24333 34463 QYGKVLYQ71 51690 6277111742
51548 10532
63240 31462 5182420655 23013 40041
2939 24783 12175 50231
22589 2092139434 62044 65338 38300
14476 62015 25056
54178 26028
62725 67311 34121 29808 59925
8242 23236 39069 21075 574333502110011 57584 12457 51659 7 3 2 2 4 2 2 3 1105916687
50747 13527 50699 6531 29756
19518 19447 61851 67935
15175 26234 16883 45920 11599 36322 28159 315502 7 8 416266
23313 159 57145 45924
15406 8GWV4HQ3YZ
59197 3050 15729 56227 4533 26997
4886854615 51388
48823
34236 32326 68385 62691 17682 9 63511
1 835639
47509 33726 49105 66585 56249
2327 3RN79EHR2G 13395 21937 19388 18699 49143 61232 49340 5891420211 18844 23752 57758 36398 4202944842
57466 41398 66518 62074 7048 9443
45008 4322667701 42897 35630
3968844736 35863 23030 3X46PHXJ0F
18247 32838 19420
23451 68597 37867 57504 57635
33262 50481 51299 27034 4186 18634 66495 17554 35178
196845836525188 36540 35606 68024
60368 45963 8 1 0 9 34813 2BW0QU6VL83102327352 28594
54619 59946
51751 3108314239 3 5 440625
5 11616 59316 5649 55387 25856 9 5 513717
7 49001 46894 42782 29017
9TA6PMV4MG
25633 15391
29237 Y8D371EG6F
9 7 0 1 26427 9184 5498765922 2 3 9 421286
2197846077 44392 54076
52946 32928
29061
S06IDMYTS9
65259
GG312PW4UB 47064 12230 59553
KM03XO8PIX 57038
52470
16933
31851
3682
42274
49086 63815 3150235113 39920
39095
48973 10772
46598 7 0 7 356343
48886 41167
20253
5 2 6 330461
29635 53676
32697 41378
40444 60246 15168
26127 34669605196 7 1 7
21342 38161
22247 53132 64183 5512626756 59621 41901 NI6Z5GV1AE 65798 39266 67707 39962 28640 33204 30884 RXDAKL32B1 8308 40019 60312 31880 2170 23807 67127 13245 50893 358 31865 5QWXQZACDV 66415 39552 44793 49878 44610 45182 42502 28951 34200 GPEMQ11RFF 22387 27699 59723 2711 23122

nal graph. Using the concept of MDL (minimum description length)


2033867415 45712 12433 6699564720 62790 16521 10882 60191 3PPCXPVZI2 6608
8121 58633 949993
317 10242
3 658316
37
12718 17030 66638 97 36472 9 2 0 7 11683 34787 10597 61391
61777
54519 66017 3929926530 68111 38203 62708 32162 51053
28098
59939 UTQEKZITQL 26632
37372 67905 33782 64612 63572 29373
15957 7 1 2 7 42431 64859 20167
62558 57558
7GKRA3KRXC
42358 27891
13798 5 3 0 48212
29874
3 26919 64452 32800 18384 41432
50830
60982 44831 2801965486
4 3 9 4 60268 15355 24231 61757 4984237482 692382411119 3 53777
46186
5338 25570
27355
50386 23145 31944 20883 8 3 9 23543
1 61381 37359 50192 49216 41203 24088
1845925035 19324 22172
61333 16975 62936 347967
3 6 21587
444977 38142 36578 27698 2347760403
34305 39534 60171 65891 53268 52364 6RK3EKZWUD
20138 61896 976 16907 23901 3965 27260 46850 58188 48570 31729
38011 67605 4 4 0
19045 41948 7 5 7 4
12472 2729 46950 4526 59136 21209
67619 13314 45446 34926 5670430478 8 9 1 151028 1292
1627835910 12654 13745 6174242995
22632 64031 15874 22990 41484 30956
31820 59815 62604 7 413498 1 5 7 7 636456 43864941270356 29595 6 21380
356602
79
29715 10337 379472671228966 57392 32544 40803 33426 47283
65687 44947
45556 17876
63463 11040 15993 10943 17431 51186 13318
6741
1048 62646 56867
20005 5823
47244
P0IKK6QUIE
4291354321 411155214953844 2 5609 7 5
754691 11557
64477 11403 24875 33715
42067 29342 47869
23448 10969 29291 57673 684232 6 9 8 26629
2872344104 44534 4 7 7 6G2NBNJGCP2 96QVRNFEEJ
37488 32905 8671 4 3 8 7 24894
44065 44100 31984 59238 2872
50835 46139 1080347116
41534 60282
37500 2976922327 19102
26724 44697 46616 30392 3389414017 35455 18279
4791162803
573653205447541 548795
702 8 0 3 1 10170
58588 62824
59343 52116 39437
68251 56440
14903 37716 17436
47249 62976 30876 58386 41074 65711 50913 1 7 2 460775 5 9 647200
4 67464 535409 1 5 427967 54556
5138 29181 678021810958449 6639921477
39916 51425 28282 50616 28473
33273
29312 1 3 3 8 65729
45605 42691
38785
2IMOCMTVP8O
656685
54 43826 35725 21963 52429 32288 32178 6251 46517 55487
66262 43182 52477 50002 65727 41317 60123 11203 51739 549151 752237 65
46237
294379 0 333512
61509
45319 249597 3 320167 7 61475 66932 8 4 9 5 48327 20994 13333 1 0 2 1 32017 19159 1937 41912 52160 48QUKV80DW
27499 49929 37343 609996 5 160184 7 38891 35154 59372 41896 21130 23054 42653 30037 64658 60311
17108
45756 53328 725603
346 46973 U8FKGZA846 6041 53481 65505 25091
65086 31612 4719
53316
9 9 9 427723
68302 60216 3 266513
33 UEPOK8EE9A 66561
14929 31890
40578 15304 16164 56948 13937 19615 460642954619436
VBHVNEAVJZ 52481 1056866200 11660 34071
42687 21895 31195 13388
13659 6700 11530
25948 13278 16424 44029 2 540730
45015 60681
32924
23465 15402 45736
35204 32595 11505
10437 6 7 6 9 6 7 9 6 58769
20152 17634 1728951604 2508
5 7 9 917793 41759 63938 25
4153751839 AISL3CTIS7 26440 63613 57605 3982 56059 24304 44442
3313 P8T0UAYB3C 56295 15681 39331
61023 5658154241 XVIVUITCUL
50130 25027 11720 7 5 3 14919 16805 50118 34222 57492
491812 3 0 17581 27337 34241
41188 39659
5 42341
491 62541 40262 6647 13105 2590235439 41928 25144
39366 60010
19512 8 7 6 5 59260 44022 21762
40852 932832
0 1 0 46443 64813
3 2404026304 8 4 6 539723 16411
17217 32368 45844 1590 2 7 3 3246144515 4 0 7 0 347518
125 56677 51763 29380 1570 42499
20058 48871 29944 44072 52967 20057 10830 9GK0HHR6VT 32552 65238 65789 2659 20552 30318 45905 36904 45 7423 60627 8971 39613 44764 23719 50692 65599 26770 E811WFMBOP 20625 49109 62452 13775 59923 52764 28163 47219
56172 52027 12899 68323 CDX06BWFIM 61918 35651 47490 24165 68204 16958 30744 63303 2
8 3 2 961826 10849 4239112818 544241 0 3 5 3249 3080867362 1 6 9 1 29941 22006 62972
ND002Q1BG9 6 5 2 847546
52024 48098 62644 5597 4331812235 4482049418 427926 621 37780
24856 6 1 4 2 15078 6 7 719169
3
41192 36899
28128 53495 64142 5 4 56663
21
12067
64787 467034 9 6 5 51233
15295 28209 52491 30579
35480 68711 20045
62313
54919
6YIBEUTH30 4 3 1 2 34513 59789 14757 12855
31268 36010 49817
62227 33454 57738
49541 29509 63966 66843 6 8 25820
31 25481
25919 36192 41967 8982
23977 28950 33543 63697 12530 34101 7668 5754218165 3088 20394 64988 50029 28900 7EMGOB2LZ1
388 1 3 337008
7 17141
57626 35798
7901 33949 33090 32635
17235 67453 60926 5 0 2 9 18956 17329 60251 43822 59190 37386 12211
49B6B1GYZU 64485
66761 12993 43076 17100 8698 13872 66820 31296
7315 MTA3PEKY7Z 39957 19123 40631 102117 7 8 0 27764 67493 51984 7463 3175 7108
12639
17370 36019 48851
22838 33650 50568 51327
3493122552 66135 46377
30770 14456 28268 31065
60811 962 39004 40528 4683205249958 57648 1 8 2 72869768651 29859 45692 35750 3 8 2 6746740
524059
2 2 56008 67617
21669 12981
4 3 5 4 34869 41792 59986 56759
10677 47357
22152 7 1 249764
53159 14312 6069163380 60751 14280
4 564332
223717 12015 36998 48002 47212 314546 3 4 213577 39791 44934 58348 0
13429 4 16467 1158027064 61648 37887 31452 51477
10375 24067 47566 2758725442 66689 33029 52419 2210763728
45121 43913 42162 48744 34519 49642 1268 28001 42260 11749 20519 63113 65284
38492
1195324295 27793
XGTOA3NIR4 44467 7754 34829
59130 36626 36267 2 4 2 4 65946 18778 18720 26596 65786 66349 15946 22014 28240 25982
35707
31832 366739
2 2 4 38419
43733 67263 68500 26673
67273 57829 66575 53506 27849
50213
48724 6187941084 31376 290 21858 5 5 4 35427 28534 15238 20599 11024
84DW0CJ3KR
RSXYN0UOON 45898 54218
52820
2408621411
67572 61101 62682 11442 26847
64493
5 267887
77 8 1 8 8 68161 55787 65379 67012 45749 349485234466440
33617 43541 23778 27822 29976 51761 MFA93CL40S KF0BU86YRX
CBJ551AAHG 64553
50966 52860 27480 565044 815212
0 32081 55442 34417 3375559730
47861 40517 25669 42460
4123826979 28847 26467 2 8 6 9 28421
51382
22234 2 0 0 912109
4023118452 64507 20732
6 4 20434 11212
758622 39085 54652 55119 36965
15587 14742 6 9 4 49069 53102 2 6 0 045385 21044
34735 45739
22895 48022
67118 13648 14823 36383 17817
22441 4344 PAN1ZPBQ1U 54128 55281 25145 582
25451 2 2 5 451771 47623 29953 21093
36180 55555
24872 14726 18601
63786
8 9746 53238 55079 51204 63173 17940 22916
23666 59205
21997 240 41028
49074 8656760 48758 22664 11180 6489115896
1 9 3 1 2643743393 62871
26042 5 5 4 4 30291
55016837 54843
52458 57281 25773
8 3 0 12 0901 259600
964
10405 1P8LGEI45I 34397 30766 32245 65797 39253 3676721482 61069 16753 16780
18675 9 423818
83 38614 50786
19808 61942 18977 10064 58546 57678 41831 62668 7 2 4 4N4UZ9TUL2 44577
8 0 7 7 6410715197 37647
34001 61796
CSF7FU0QIU
56644
10115 9975 8 5 0 931219 6 3 8 4 5357
55924 28779 60481
43345 1 4 33528037152
845181 4944
43948 9281
61169
4 5 0 335300 1078719073 399846332247703 53974 44674
14510 49601 30160 36860
9 390 37932 4 5 715343 48891 2536 39935
4959 55158
35892
3 5 1 1 1 6 2 7 4781425239 2 1 163850
0 15312 32803
6060445230 38656 35665 40559 55210
35794 45107
49794
19974
22386 43588 4089461247
5525955735 67BWRKEJHM
17885 3423 1 2 2 7 39773 31921
42001 18031 23794

44262
26827 5471 21301 28737 9 9 8 3 3329337737 62345 10717 12439 26021 38767 3489139548 45079 25729 28278 22270
26503 58210 8 2 7 51 6 7 3
7 6 4 64208 64971 11264 13133 9321 UZOHL42HS3
6515
KK38SC5VZ4 52827 23208 42521 46798 19782 32562
47415 60931 51876 7 2 5 5 941221 559882
47049 59077
52204 5 3 7 9 54785 34364 42581
28138
63252 66652
35363 3616355819 40077 M08FIZ654E
50884 A39YVHNAU1 39053 45863 32749 18119 2863330493 7 6 8 32881 8 8 5 135424 20532 49743 46001 28292
5557065857 14324 12715 58736
16009 66410 28549 23860 9 7 5 7 P1NMJ2U0IG 3151452271 36154 374QFB0ZR7 6170 9550 2255 53283 51044 54242 10318 43895 33795 31973 66383 18243 33791 66026 46484 67805 50160 6832 932 47231 50663 10498 42945 5748 11043 42600 6595 52712 25822 4COQ520E4B 66653 25288 10296 66586 34821
2121 58462 41127 67247 60861 60516 24002
66612 28426 66593
37392 22609 4866 51438 119256127553406 33973 37025
19982
58086 4104 175
13496 15089
15005
3463347374 41687 38465 19660 51464
12799 46695 50052 48773 19734 68288
7 3 3 564740
23970
50205 26654 27363
59171 48104
27221 12841 58301 16320 32324 25137 GCSC8PW8Y3 2 6 5 88 4 1 1 1613 34646
6 4 8 1 23633 67157
QTI71ERVQM 44886
55709 38081 55095 9521 34849 29106 49295
2691310425 9470 65325
40663 30783 68125
25264 24313
BDJAXEBNIF 33246 20092 10756 50149
63721 5062053907 27774 25569 2273516016 9451
32481 23659 59861 62234 38825 25755 11535
44066 6718 26977 39696 64826 63363 34989 30205 31525 52690 46311 49908
369 37616 30545 28083 7427246
2 3 1496009 03
58507 3 4 6455 2 130012 561203444168312 20130 51138 60024 68147
58257 59366
37108
55625 10901 15032 40924
30217 43090 12411
8 8 07 8 5 933847
58887 56559 5583549658 30896 24626 23023 10537 WON6FWH0QU 23116
45539 53633 53387 1417627652 5 3 940147
98 6 6 0 45417
49369 66882 25675 33200 3 7 3 7 40369 2U1FXOTW70 45142
751 46534
43768 19133 8 8 0 3 10863 7 0 2 3873014083 4475329256 18063
60797 63171 48993 42207 511069
33231 56821 28325
65208 4 968238 7 44860 53825 11913 51962 14888 32067 16853 41302 59471 9 6 7 5 942 2514
NPR07ZLCAT 17405 37268
39460 30643 61977 6476 1791113750 66755 30800
37390 38705 20793 10054 56311
52919
58685 58423
21952 7566 35096 24408 16663 25844 1099 19917
24394 38480 37Y0269DID
64430 66152 21170 1616 15751 39402 57942 68727 64697 56167 38253 18533 22961
48192 34605 6 6 7 1
65022 27090 EAX4LFDT9Y
24492
34505 UP4CX1BUHQ 56430

30744
43919 56264 18908
62514 6439 48120 1441 35736 27312 26766 34620 48577 56982 24780 Z69S0KZ0FW
7594 56267 19528 31271 22502 63009 67757 12258
56083 4552134596 680418 5 9 59870
49356 25866 41940
44555 5745 58133 50371 4866642856 18191 38107 39020 33786 37653 30863 23307 57947 1GH9BZWM2W 41477 GG24DSURGP
24603 42378 9285 31314
K6U8DOWMDZ 30868 65322 64352 42230
43425 17667 57125 39498 19740 11458 7703 61580 25764
57614 53308 12961 46297 58231 40345 58028 36438 49726 10486 39803 48343 28320 16610 1 0 2 526702 5093556775 40987 41977
45402 19180 2871263396 47884 61032 27139
8248 26415 61212 61147 67377 9815 59519 67080 5559 2G1MWN358N 20878

24394
ZD8PLGYJ8F 16823 64865 5453634138 4 1 5 8 60445
32992
38865
3610316728 47725 68498 64908
62335 9 6 3 8 56150 37233 60720 52184 6230267790 7 2 0 3230256014 30367
37538 12949 48262 11074
64950 66288 34298 12749 9656 25064
60094 26146
53045
30325
67166 15604 48583 13052
19299 45983 12096 36494
26363 27434 23169 36687 26784
17506 53638 6 1 6 36775
1 WSS8P25SJC 65120
176 18441 12487 65586 60332 55479 13221
19237
26958 53302 9533 ER0XNI0VGB 60177
3430 58414 4 9 9 7 9 0149582 6 43889 61987 47293 20810 21688
36481 667610790
62245 33883868236138
43867 37332 1760116570 65940
7555783130073 38369 27735
35236 2867139037 34665 22613 YB74PLZDBD
31157 63120 7470 26311
61830 49332
13664 21747 18015
43356
7 6 6 02425914304 4 2 4 1 25197
7237 25428
24127 51045
36525 20544 30412 16816 26860 41973 8 3 9 7 8 8 7 8 58650 8 5 2 1 2003058410 65099 32205 23560
49727 19921 4 8 8 522408 81309338025
561129
02 23297 54369 9 4 2 66778 26422 65622
3 9 2 5 46630 48138 9 6 0 42595
5 30997 8527
42399
65248 51355
389496066545869
565832 1 0 1 2374317384
35067 4472124765
13234
2 8 8 239340 58395 9030
6705226446 56765
68400
724544
379 59651 21934 41338
11957
15482 9513 2788844594
13239
60041 55636VDSDRZRVNN 61336 55867 17481 21447 41608
DXCIPIQSTE 2517 39236 30417 NR5OLU4OHJ 43548 R03YCVVVWK O5R4HXU6O7 26314 45391 57018 55182 8218 56092 27076 7356 61194 67276 16036 43730 19225 62445 61834 35065 JRL1YFWO1B 50674 13019 37395 49693 13551 65239 43486 27128 65194 22844
23763 23698 37983 4803115219 38256 65017 22016
1491114489 32291
C8ZC2CXD19 53054 36290 6250815631 53444
48405 58729
20943 56970 3479 22996
11199 53192 2083464152 41595 5662828348 16654 25019
23106 43811
29732 21729 32439
7930 45643 32722 NAJ0ROV3KH
22769 37185
1539 21834 5097266662 49939 38847 ZIVWKG2Q9N
58493 9 9 2 0 24504 25750 10541
14245 11618
GQI5QWTVUT 5778 41465 63423 50514 64103 7449 31398 8 8 6 3 27910 50073
57197 22862 28406
67851 61659 10112
41932
17174 27864 51411 49681 2206240737 6827812143 3058TKFAWOJ
737417
2308811 31760 B7CTNFIBLG 25849 59216 25249 ZV71K1UDU0 20471 61295
5510
36718 29772
41458 29750 62676 49480
41541 55322 49199
63156 53037
62424 8 2 0 6 24748 22559
27024 3788 459424 6 7 4 10631
21714 23245 43361 22482 57479 E3PD254RB1
42925
43783
26609 27218 31907 41007
25044 19183 65553 19904 10515 33986
60134
6782051090
5730234151 26805 67282 16476 H52CR5L7TL
15108 56191 515 65562 50360 33315
28656
28439 34744 IZOJVMPTT2
55667
14461
22691 1783 11986 6714959095
35103 63979
28461 43013 51240 24013984
27896 62597
49563 6 5 724571
8 7429 2194 2 224834
70
8CB2LACG6G 37626 55229
48984 414 9 8 8 4 46759 18694 43242 30796
63208 16869 1778 61591 16 19365 17510
W5TAO0HUTH 54787 15617 17182 37835 1 1 4 7 50597 59998 41518 10659 11471 53064
16455 10111 6181
KNTX7WEKOM 30946 37199 6562412703 10093 4014 68221 6464
27100 1552 37112 L1UM1UJCGQ 32570 61745 43080 51405 7417 67268 30559 55491 41176 59858
15338
32814
63134 2995 21073 36278 42018 16026 OCFAHVEAN78 6 2 1 32581 31838 68144
57772 24478 28111 56063 8599
42999 56237 59694 36033 35376 31620 48441 25307 27670
39598 3189 22568 8253 5477 9105 22342
57681 WUV23RK4GE 8271 1597
34276 18219 35107 40359 51851 65302 59357 21568 14342
17398 47263 27346 23918 53748 36522 65661 43700 57246 8416 15843 42727 OBNEI7DBZW
FWLL0PCRZC 2737 4 6 5 12 0 7 09 7 9 0 31202 865197
05 24769 64011
7181 62775
50713 14068 51498
64191 33810 49573
32856 15334 25149 46786 34376
22643 40634 64244 793 58530
2735 55515 3378
25502
2531440594
2013
6AO4WOPINR 41172
1185
23580 7 8 0 6 50090 65114
64577 XTW6OZ7J17 37520 34742
19318 3706334159
28262 623061WGCLUR6P9 UQIF68I2KK 2JPJ1DVIWJ 67778 22153 44263 20108 7992 52750 28204 50753 13246 55974 47864 36084 482 55748 32277 16762 13367 21590 40297 9168 23983 26140 4406 42891 39775 36914 6792 11662 11098 36117 64621 62692 32466 20006 15267

9746
15711 39115 16031 60839 674 24300
4687238484 63883 8586 59391 46279 38122 1126 13110 67327 42894
60295 26943 10497 16059 61588
10225 23220
42491 36939 HM47OASPBO
2841017950 4WOFQB2BQ8 57496 4450 3450328857 67706 25721 65346
29778 48248 26707 2339
1434452276 23797 04BI5RSXG0 32606 D2ODMWYZP8 65193 59276 30589 63736
S9NF9NN64Q 34399
68636 62762 404051 121575
74 20449 4976
22298 6561764063
54624
21624
53137 13735
27007 3W59552TCP
40438 19342
12368 32309 58733 47132 12264
3916 26661 955 40322 64931
3504 45495
3228 68566 38499
60678 5 4 5 0 58323 37953 4111
F2QYJAKERV 12829 38210 GXQ38JFDLS 14864 48437 34475 3373 68054 31051 9303
4701 4305 58035 25335 30971 67253 27222 59710
54125 18125 50892 446247
137 118548
81 25163
58474 12081
4247 47939 64394 58653 55545 56437 9110 40112 1222761849 265 67289
57444 33925 67383 638548 60148 51266 19906 23674 57445 56126
UT4ZS6QJUR 32582 58926 645921347 8414 54099 19997 PVK5CLN34H
44321 41935 51857 10548 JZUKXZ666E
FHCXKU589W 32551 58703 29220 55548 ZZX0XK2GLL 52050
8359 20715 32420 1081565707 30886 CCNS3KUD81
27941 47880
22747 43095 QNNM6UO5T7 66207 56687 58119 3904 38264
52351 20199 U6O4HSQDAB 65792 45666 641Y81YCHQ 3419837806 1078
55520 63182 19665 49149 1846 8970
3 3 5 9 40327 177 19488 40698 17499 61020 41388
742 47597 64298 24566 26947 4074 38092 183 2904465665
20558
49040
49553 6 8 9 0 10256
S8WZXBKJVT 1969940316
TVTM1VFYTG 4371721339
23176
10923
41919 34797 53721
32727 37238
10409 57002 24125 375 21392 58075 65314 50915 36072 CM6S3SO58O 23289 42773 18266 48584 24400 VD88TM0VX4 9524 5557 18434 7766 11988 67911 41697 39149 61351 39976 41844 10569 12910 6534 7925 58456 20867 40223 40134 57257
57343 46993 ELVQK0SX38 18786 68317 4K4OBV2SHK 5342
12295 49110 57091 7990 48299 40010 12587
67524 59834
13558 26509 142353 4 9 2 52333 8085 6084 65832 3AW2NDIBX7 BFMUBV3NUR
38493 64707 51719 48677 9OVFUNG77F21907
19714 HHK10KK2SQ 61622 O412G4FYA5
44430 6065 2754 5806 61649
18980 57183 9686 6255020420 42329
48674 66849 47451 477 YMN5QVBTW5
5601 13432 45663 32135 6504 8875
1202 56639 66474 4M5BS4Z2GN
35697 P9XO1N3I7P 25448 13347 5PM8WM1KDC TQ5JN30M7T
JTKWP6HLD0 63407 12277 7576 50576
47399 49471 47002 21735 19177 55497
34194 W79N73T2L3 11954 38361 36682 BW179KHFMD 50079 23803
2133111610
VISIC5BQPC 61985 IQA4M3R3QT 55676 26222
61289 40384 1110 32761 1335528371 D08YDKHOJY 25060
24888 IW1H37D4JC 53715 PIF70Q12FD 25614 5R24N3G84B
52349 63519 66541 57396
59150 27728 18285 47713
IOV25L0PZX V7OT9ADZN1 47681 QJ54HNB7C5 1973 20741 32780
17671 2842 9574
6240 374 49334 T5QZ6QXX3F
46357 6900 40274
41109 66149 5885651891 26650 34206 KI7NXY4FNA 41180
42951
165 43744 13260 50435
P8XK0RPGHI
44787
46963
42384 14495
YGBT4TQVM1
22202 12485
P9HHVJY7ET 15846 66833 57549 43284 25571 39130 64354 20036 34310 53160 52061 54595 25277 62753 13416 55260 47748 27677 9615 11992 53440 41699 43520 61835 49765 10116 50689 R1GE7LTWBR 15442 10572 13799 35916 48805 51173 32255
27767 8575 59295

14017
DY5T73KHQX R7ZEULIH14 20
46H1HO9OV7 36602 33265 AU39ZTUX9Z
5828 21573 66616

they try to create a graph summary with a minimal set of corrections.


7133 MKKOCRMLDX 13443 59809
6292023132 9VK9OX6K8W 52498
30596 31416 66251
61520 NN2T39BGDA 21385 43550 36156 51565 10231
10655 8G9SZF9IJ1 32295 CZAY6C5S19 17886 10072
56244 30453 45778 Z231ZFMK62
B2HOMDMS88 38084
42923
29117
50662 21141 X75EBFQX1H 5866
20UYHKWDJ3 55967 52313 6044 45697 35207 7140
59805 41812 40280 28095 68045
59692

65922
18639 FSZSWQXRCP 64725 20877
6GBYQGVK9C 49453 31493 67722 55511 42077 37837
21563
13794 4018358426 46924
19703 464 41519 58858 IOY55IEAUV 50033 52556
XGVAFYXZSL
67318
10295 17307
61180 E4XG4GYQNX
2885 52185 5904 51513 66169 37195 7350 16758 22554 14522 295 50V20ANOEW 11079 23288 48082 3361 23605 10475 31975 32915 3VXIVBBRDM 60175 35912 32757 45124 41837 39073 24825 55704 15756 39551 39273 60873 37999 33950
20609 19191 46585 55263
31992 65697
63359
L2XRQPYCYK 2243026734
8638 42043
8980 YZ5WKL6XXT 39586
30162
O5VSIGYNIA 23788
3012
19319 3630 23838 31DK1IKV3L 65651
20158 57371 KSUCC9A6K4
16827 GJI9VPZKSD 12661 11755
OO6CGQ7LLI
5278 39616 6RONW7B0V2
61415 62076
61198 M991KKUA16 34598 26394
SYKJS3LKOW
CPYOSAEB8E
25592 981
4869 3C7T70L8FQ 9WW7OCV7OK
42818 31159 68015 1TC3IT0D8O 22238
12444
66359
31028
LOS1ITTAQF
I00D0789JB
2ODJ7KBTKS 21008
55173 22156 31950 32443 32913 62047 54261 6285 4313 43755 62250 61350 65597 64698 32804 4X1K4AOVIG 16936 ERAQQJ53VS 7240 41198 1769 2665 61248 62879 66822 40334 55935 13265 7074 FUDBYACM4F 53689 67760 7128 41469 J4XT0KPMNS 54113
38706 20169 JHV5BJUOXI
0Y6RWY21T6
32409 CI5BY184NB
A5L7O891UQ 60284
32759 48409
LVANBOWY0O SMIRSQNJA3 34958
64163
61608 39235 42486
575
57272 25639
JUNOX98K64

11850
36213

4128
13651 12214
5816 41423 40665 32473 62081 21867 47058 36924 39382 12952 56641 27653 18681 42942 8081 6587 769 21864 34484 11094 1186 49153 23062 18220 5141 7768 11761 1770 61636 7713 3992 26552 2076 40539 42503 29118 27048
44618 D1E5KFGBWJ

64815

62227
63211
47878
43KX3J8IN6
SBKXF1P4NN 55161
6540 10086

54251

15774 68175
50112
P49IOCOKO6
4113 IYHG8DH2BB 25143 ECQFHGS3OX 13336 54574 EXNCS1HR1T 12434 47421 1188 49457 22625 28704 30378 38860 38436 3487 27035 43555 20865 28862 N9845TOAH5 7077 31837 61197 32246 37192 39776 30948 15006 KW3UEMHEJV 28200 9612 40347 64597 26849
34421

13648
5976

YYQE4V67A7
65237

IXN280FW2K 44583
8AMNILNGYL 10490
44847
36999

4459
59690
58788 66496 39088 59514 14426 24788 38379 59478 5898 13210 41008 30915 5965 32303 58508 4107 11077 9912 36891 31808 47886 52239 23537 5982 28331 68205 21182 12218 43589 25031 13727 28051 61661 18136 22111
CQCZ1K705E
9828

5808
9793 47454

SHMX1MW78D
6510

BD7RD8LGU0

56442 8146 16226 58454 51300 24239 22356 41839 50634 26379 41936 39197 23346 37503 6043 36606 64427 51708 58377 8576 56645 67927 14047 43151 35762 18575 3192 29300 680 48736 10570 22256 30372 3481 45467 36115
68025

In other words, they are trying to minimize the reconstruction error


62546 4985 67407 977 13707 27442 9927 13201 30469 42943 38109 21188 50418 44978 68163 20554 51334 50362 21904 23283 34830 11183 27158 5344 34199 42073 12419 47975 61449 23152 59708 50760 7PI2CVZ26X GGB72FF3RE R6SJIM50T7 K28XW2DVZ9

38379 30915 5965 32303 58508 4107 11077 47886 21182 MBB67I8KAF 28051 18136 22111 4985 977 13707 27442 13201 30469 50418 68163 12910 51334 50362 21904 34830 27158 34199 61449 23152 59708 7400 50760 42501 55837 10660 64615 63367 YCAM8D6YR9 41424 65029 56568 26613 55682 10319 63856 57600 1985 55807 9558 ZDGJW4ZLYL 49967 22284 48393 14405 36116 38861 28045 10804 22116 20801 40392 743 49108 7992 26735 9941 5890 62052 64387 22573

41839 23346 37503 6043 36606 64427 51708 67927 29300 36924 30372 45467 36115 50333 42345 65230 42517 59887 23180 50683 62931 IFW8PUNG8M 22788 14345 28030 24789 54179 11730 11688 38868 277 17045 27 68000 38931 1293 26469 10949 52544 23179 25503 15467 29929 64599 22784 27461 39128 6494 31882 65798 39266 67707 39962 28640 30884 8308 60312 31880 2170 67127 13245 50893 358 31865 B3II3ZTVUR 66415 39552 44793 49878 44610 45182
50333 56157 42345 65230 42517 30912 59887 23180 26726 51521 928 50683 32372 62931 5944 22788 14345 28030 31330 24789 68331 54179 29071 11730 49452 30580 23133 11688 38868 277 27

53521 22257 13366 10410 40751 54263 47051 52763 20058 48871 29944 44072 52967 20057 10830 32552 G2OIHUSMJP 65238 65789 2659 20552 30318 45905 36904 45 7423 60627 8971 39613 44764 23719 50692 65599 26770 20625 49109 13775 59923 52764 28163 47219 2517 39236 30417 43548 0UY8BQ3PC0 26314 45391 57018 55182 8218 56092 27076 7356 61194 67276 16036 43730 19225 62445 61834 35065 MREN0ZIR84 50674 13019 37395 49693 13551 65239 43486 27128

42502 28951 34200 22387 27699 59723 2711 23122 6170 9550 2255 53283 51044 54242 10318 33795 66298 31973 66383 18243 33791 66026 46484 67805 50160 6832 932 47231 50663 10498 42945 5748 11043 42600 52712 25822 66653 25288 10296 66586 34821 22153 44263 20108 52750 28204 13246 55974 47864 36084 482 55748 32277 16762 13367 21590 40297 9168 23983 26140 4406 42891 39775 36914 6792 11662 11098 36117 64621 62692 32466
UT5GKXBL6K ED8B72GD9C OVDEPC08DU FHJL19RBXH YFFVTDDE5I 78P2ICZMNG MVWX83GD5W RWYJJIS15B 9EQ1GR7A2J YYM2WCWYGX 6UUKFVBKQS Y5JREJQBGG G06M3087ZK LFR0LFJXMR Z6DZE2YQ52 18LSOP7CIE HIKX2TO7J5 791IVGBG6O ERGHT6P548 TWJ6HYTW0M PQ7L7WX5Y1 4F1LFFPI5Z ADJA1KDX70 RALQGC1LR4 1HKR05LIFJ 8KJ86AEZYT 1HJ5JTO9KV GSWTA0OWO9 DZ2YIPGIAK L12FKE0EA6 NUD11JNHT5 G6ZWDH3YBW 3GL589TKI9 EZTB1L5YXX RATW1KJRHO

65194 22844 57002 24125 375 21392 58075 65314 50915 36072 23289 42773 18266 48584 24400 9524 5557 18434 7766 11988 67911 41697 39149 61351 39976 41844 10569 6534 7925 58456 20867 40223 40134 57257 2885 52185 5904 51513 66169 37195 7350 16758 22554 14522 295 11079 23288 3361 10475 31975 32915 60175 35912 32757 45124 41837 39073 24825 15756 39551 39273 60873 33950 32473 21867 47058 39382 12952 56641 27653 18681

20006 15267 15846 66833 57549 43284 25571 39130 64354 20036 53160 52061 54595 25277 62753 55260 47748 27677 9615 11992 53440 41699 43520 61835 49765 10116 50689 15442 10572 13799 35916 48805 51173 32255 22156 31950 32443 32913 62047 54261 6285 4313 43755 62250 61350 64698 32804 16936 7240 41198 1769 61248 62879 66822 40334 55935 13265 7074 53689 67760 7128 41469 54113 25143 13336 54574 12434 47421 1188 49457 22625 164JDMGEZV WJKCYZE55I LO8BL1EKMZ FUGUHM111D VLM3X4E801 617D2EORKJ X3930CHIJJ Y48C5ID7BA 8ALHCJEF1L YKDUOPLVNI Z5FLADU6PZ EXNJA8Y8BI UHE9NCSHYV ZJ0XPZQX7E 9BYU6W4LNS QVZHD32WI1 8P753EGW9J IPM4CCXNOJ DH0Y0WAZQN LHJUGAS915 6A3J3QR83M IA0YDDGG2E EGB0T8RI4W 3J752POLVF ZIK5WP6BEK SN4JWBLZI8 2PPFZIQEPI MVPUU5HC7K YLMN4NQB7G CZPW2JTU5D JRX8LHXVEB 316BNYZYOK YY7D7QFUWK 3B87B6BDYY QFTZ2DWQGB

42942 8081 6587 769 21864 34484 11094 1186 49153 18220 5141 7768 11761 1770 61636 7713 3992 2076 40539 42503 29118 58788 39088 14426 59478 5898 13210 41008 9912 36891 31808 52239 23537 5982 28331 68205 AJV1J0D05N 12218 43589 25031 13727 61661 67407 9927 42943 38109 21188 44978 20554 23283 11183 5344 42073 12419 47975 P28NG302LK EC72GZVFPX 1Y2TKOSZSI UT75NG9ZV9 B63PRLWWX6 YZWOKF4JCX AMMS8ZW5M9 TKZEW2H6DN MA41ZDEVW7 DNZMN5GWVR LXLZN90P6S 47ZV5E5ASY 3HF8Z0H89Q IX2X676BLY E7Y4NWDVNM 3G7KXFNRZ1

28704 30378 38860 38436 3487 27035 43555 20865 28862 7077 31837 61197 32246 37192 39776 30948 15006 28200 9612 40347 64597 8146 58454 24239 50634 26379 41936 39197 58377 8576 56645 14047 43151 35762 18575 3192 V8ESS3GN5B 680 48736 10570 22256 3481 56157 30912 26726 51521 928 32372 5944 31330 68331 29071 49452 30580 23133 6VB0I7CCXX W7DIJ31QGE 0KCE6QGGYI EE26DURY3B 7O7TDJ65V5 8BPSHHCPF8 PBP2NE0WY1 2UWATOQ5SV 31PGW10OGU UBOT5Q5RFO 3C7U3GVWYI 8DOEZ4EVYH 059TVJEP1N 8LIQ6GLVJC 5ASBDELPNK XXRQQXD9B4 KF7SV67LLN O0EU7BW8ER 2B54NW7LR3 VLVVOLQYJG 4SGK5YI990 IV4JMQDS9L MSM4FXPL5X FXLV8VHYQW KYYI9P70YO VOBIGKSHU8 JAW56SUEXG H03JO1VVKT TEKUUNSVUZ 7VPLFY5PGZ BWMDJR18PU G11FJ6I1IH UC3CNGX6CP ULG48E1667 RFN0HDM1O3

Y0R64EBKKK XMWRP4ZKK2 0HP5DUX7RA PI9QHFQITK 6IYXRLNN7E 8M7U88MKKN GIMDBH9MXL AK1MVAJ4FE FLYTMHDV4A P9B3LVRE58 FQAJKIVJN3 RTC8EJGKSD 7CB67WPNKV 9ZLKUXT0XN UUXQP9YIXU 6BZ8GEC728 30GMCY93V2 MO13QMOKVS XSR5L8MOUZ JJWXZN4TLT JZNQWB8GB9 4AQ7GVO4UY 2EUV7ULO8P S16ZTXXK4Z F14KQVHQ83 MLW9V0OZK6 XM9DVO1G6A LEUOG7XHN0 DK68LO7S4O 6A0YT2UE7J 8ARKE5S08B R4NA0LYC72 02A53HYA9I 9AQQDRNH9J 2PWCD02OD3 OFRANEIIAT 63J6DDCQK3 PPF5PWF885 JFXQ7GWBIR 06S61YL2KC ERRVIJIP9N U0TJRPD5XA E680SM0Y0U 7AJKU19BFC SIJUE4TOPK EKODDL5NWS YC32K5XBBC 8MJHVLAMTD 4MOGYWFGN8 6DJRVFQAEF ODX2LP6ZBU ZB1L4V1Z09 KQQ4FK6382 T6D5W3SVH0 BTNUNGGCKO QTQF5I2KVY XBGTFEVCT7 B0H4IPT8UH R4T02ZFHTD YLNH8QIJ3D A14LMY88V3 NQTFMU9G75 E8EVZBCJ6K 282D40W456 HCL8FEY08V PWIR8FG326 S546PBW5GG LYZTOOOTTF UNMMZ76FHG DQLTN0QKOV IB7924UD4K 1LZ8Q201C0 K71MA5V4K9

5WOWDV4BUR CTBJ38N0O9 CG0QT6FI2S 5S7CLHGVT0 96WKVYJJ53 8KMYZSJ1MK RUH8IMLQEX 5PRHNDQZ4V 63RNQ2G59T GG151J0BYA YYQ6Y6J6RT 9N6PWFEFV1 79HAWXA3AP 5NYRXXFCQD 2D5779G0P6 0Y0279GGYJ KMRJANMFYL 0IJHKWV6ZL CI9O9YFUEP Y15SEHIBPC RTA8WQG9W1 AAWNNNT4AQ 24772 LJEY8R96XL 19KVDR3OUO T61RUG490Q 6A2XRNXP9I LWU1AT0KMY A600CQKV14 LSPSK1VYFY XTUI94BHC0 JG5W0RY2Y3 NNQY8E8KA5 QNT4TKDX60 RZ98OT46ZP CYPEU8IG69 AXY5P966GQ MIY8K68WT5 7MAMJHNW0R 03GXAXA6MS 09C5K8GHNI 9ZIO58N6XC BIPMLV7DVN US0E2F5KY0 32415 TRKQ1BG3VG MTRJEA2ROZ L5PM0LP0YS J7I3QLMVAR ER2DZ8Y3Y0 T7CGKBS61P 21XR24Y0HW ZS87UA8HQ3 WT7FB3BW8U CRXS9ABTP5 8WT54LI3WD 24772 NRXYJZAVKL 0RPZQEKEBE RIG8V2IEHP 33IFRJ3M8U BZE0NYLXP6 HOIQK4EZG7 5IXWUEUYK1 KDJPKGGNO3 32415 RUNYOT5TBF TGWP0S88N4 X4VUJV9YZW VVHGFJD8UT

10319 23289 63856 42773 57600 AMPUNF3GZO 18266 48584 1985 24400 F01BQ2VHCR 9524 5557 55807 18434 9558 7766 11988 0MBSBNGUQX 67911 41697 49967 22284 48393 KFKJG27F69 39149 14405 61351

(a) Input (b) EU: 0.9 (c) EU: 0.5 (d) EU: 0.1 of some form. Tian et al. [23] provide distributed systems solution
22784 53160 27461 52061 39128 52435
for [25]. The solution that we propose in this paper can be classified
54595 25277 6494 62753 13416 55260 47748 31882 27677 65798 9615 11992 39266 53440 41699 67707 39962 28640 33204 43520 30884 61835

Figure 9: Visualization of graph ca-HepTh and its summaries for varying


utility EU. For clarity, we ignore smaller CCs for (9d) . 36116 39976 41844 1FOPKHFTR3 38861 10569 into grouping-based methods.
28045 6534 7925 10804 58456 20867 2RAU05AHG6 40223 22116 20801 40134 57257 2885 40392 52185 743 5904 49108 Y417ELGFIG 51513 66169 37195

8308 49765 10116 40019 60312 50689


Most of the techniques discussed above deal with minimizing
31880 15442 10572 2170 13799 35916 23807 48805 67127 13245 51173 32255 22156 50893 31950 358 32443 31865 12910 32913 62047 54261

reconstruction error without considering utility maximization. Only


show summaries with decreasing EU values from 0.9 to 0.1. We
7350 16758 22554 14522 26735 295

a handful of approaches consider preservation of utility that too in


LQI7AG5NHL 11079 23288 9941 3361 5890 10475 62052 31975 32915 GZXK7CTZ1E 64387 22573 53521 22257 60175 35912 13366 43895 10410 40751 32757

can observe from Figures (9b)–(9c) that, as the EU decreases the 6285 4313 43755 62250 66415 61350 very specific scenarios. For example, Yan et al., [37] specifically
65597 64698 32804 39552 16936 44793 7240 49878 41198 1769 2665 44610 45182 42502 28951 61248 62879 34200 KRWQRBLMTC 22387 27699 66822

number of CCs increase. This happens because as RN increases 45124 54263 47051 52763 41837 39073
focus on entity graphs with meaningful node, edge labels and sample
20058 48871 26552 24825 50753 15756 39551 29944 39273 44072 60873 7992 52967 20057 33950 32473 10830 32552 62081 FTDQ9LWK9J 65238 65789

by virtue of the decreasing EU, relatively lesser important parts of important nodes (entities), relations to create a concise preview and
the graph are collapsed—and at some point they tend to get dis-
40334 59723 2711 23122 55935 13265

utility is evaluated through human reviewers. We also note that


6170 9550 BXJ5UWL62Z 7074 MZ6HSWY2BI 53689 67760 2255 7128 53283 41469 XI7H1WF1XE 51044 54242 54113 25143 10318 33795 2X5RFH4LI9 66298 31973 66383

connected from the important regions because the UDS algorithm 2659 21867 47058 20552 30318 45905
none of the above discussed techniques deal with generating graph
23062 36924 36904 39382 S312J6TGGG 12952 56641 62452 27653 18681 8GTWQ4WK7P 42942 8081 6587 48082 769 21864 34484 11094 1186 39775 49153

makes a decision that maintaining connectivity in that particular 18243 13336 54574 33791 66026 46484
summaries with a user-specified utility threshold. Moreover, the
Z9S7VVUPAO WEMI4V9ZYG 67805 12434 37999 47421 1188 6B0JUJODW2 49457 22625 23605 28704 30378 38860 F9FDCJ66H6 38436 3487 27035 43555 20865 LZBZE8JUU3 28862

step is no more beneficial in terms of the graph’s EU. For the sake 18220 5141 7768 11761 1770 61636
majority of prior work evaluate their techniques with respect to a
7713 3992 2076 40539 ASLPV02TC6 BLR3RP7XDU 42503 29118 27048 58788 66496 39088 59514 14426 24788 38379 59478 3H1NBKNBLK 45 5898 67778 7423

of clarity, we ignore smaller CCs for Figure (9d) and only show single application and do not demonstrate the effectiveness of their
largest CCs. Overall, we notice that as EU decreases, the largest 7077 31837 61197 32246 37192 39776
summaries to more than one real world application. Finally, we
30948 15006 28200 9612 40665 55704 40347 64597 26849 8146 16226 58454 51300 24239 22356 41839 50634 5YZHVQKG8J 50160 26379 2GPNQMGURT 6832

component of the graph gets disentangled and becomes simpler. 13210 41008 30915 5965 32303 60627 acknowledge Koutra et al’s work [15] that partly inspired our work.
58508 8971 4107 11077 39613 9912 28204 36891 31808 2716 44764 47886 23719 50692 52239 65599 23537 5982 26770 28331 5LFW91NYZV 20625

41936 39197 23346 37503 6043 932 36606 47231 64427 51708 50663 58377 1VLUUK5P6O 8576 56645 24UQA4LW0A 10498 67927 42945 5748 14047 11043 43151 35762 42600 18575 6595 52712

6. RELATED WORK 49109 68205 13775 21182 59923 52764


7. 28163 47219
CONCLUSION 12218 43589 25031 2517 39236 30417 28051 13727 61661 18136 43548 22111 4985 977 67407 13707 26314 27442 13201 9927

Sparsification-based methods. Shen et al. [33] developed a tool 25822 3192 66653 29300 25288 10296 We strongly believe that given any complex, large graph, the abil-
66586 34821 680 48736 10570 22153 44263 20108 30372 22256 3481 45467 52750 36115 50333 42345 56157 65230 13246 42517 59887 30912

called OntoVis that simplifies the underlying graph by relying on 45391 30469 42943 57018 55182 38109
ity to query a graph summary with a user-specified utility threshold
8218 21188 56092 50418 68163 44978 27076 7356 20554 51334 50362 61194 21904 67276 34830 23283 16036 43730 19225 62445 11183 27158

node filtering to understanding large social networks. Lin et al. [21] has tremendous potential, and can find applications in a variety of
propose an unsupervised technique for egocentric information ab- 55974 23180 26726 47864 36084 51521

use-cases. In this work, we present a novel approach to summarize


482 928 55748 50683 62931 32372 32277 16762 5944 22788 14345 13367 28030 21590 24789 31330 40297 9168 23983 26140 68331 54179

straction in heterogenous social networks where the key idea is to 61834 34199 5344 35065 42073 50674
a complex graph driven by the objective of maximizing the utility
12419 47975 13019 37395 49693 61449 23152 13551 59708 65239 50760 43486 U94Y23TQBM X8PA1DCSUA 4MNPYTEU53 SVJX2TDMY8 URGN9FRO8O WLCKDNF05E JHDBZAKHJG 7YVAFBLGZM GUU9FR42GI RN6SYVAPFH

filter edges as opposed to nodes. They design criteria to distill impor- 4406 11730 29071 42891 49452 36914
of the calculated graph summary. In doing so, we establish the
30580 23133 6792 11662 11098 11688 38868 36117 277 64621 27 62692

tant information to construct the abstracted graphs for visualization. theoretical foundations of governing the properties of an ideal utility
MZ2VX3FHVC AYCGWCXEJY PTGZQCW739 M4NGGLD3YU 3HNHW7IR58 XF2I5K04G1 X6NJTT98VJ 7VKN0L3PQ4 W4B4G62Q23 ETRYAFUU16 1FTD8UFP9Q R9027GWK1E ASCCOIXPZ3 PRLQFQQNX1 47NUADERW1 7BXEVGFTGH JDNQK6D8RQ B5I7JG4O4O 1OAWJFKLOJ LYOCE4K41G 16QCASW0EH 31AVWTN8WD IZ1J2VPE7I E16648PJ23 OUQ6UOM731 ZQODYF6AZG PAD3IFJPVL H1G02Q9YK8 MEAFP6UTKT

Sampling-based methods. Plethora of works [13, 17, 24, 11, 1, function. We make theoretical connections to well known problems
37] focus on simplifying the underlying complex graph through 2VA1JA9KJ4 CUQRG7PTZ3 52N3A3HQNH C8GAGYIXQ4 EZKXHISI8L 7630QHNPGU
to prove inapproximability of the problem at hand. Subsequently, we
PCWD5SI2AR 3XKZ3ZYTQU MQ6Y3VG72D 71210V1DGL 1GPC9I6GQ1 RM8SVPTQPG D3TSVV2K5K VTF0YQSO3C RKIYY8N7IB TJR894WSBK 95V9NXSL49 R9JLKZVQJN B4X9EKNHXN WDA3RS2S5T EUHA18JTER GEZPT8R736 6TZVRDMLD8 SH91IBCBIX Q1GPNB9SDO T907FSUS19 S0KHJPJ9TN 72Z9X27EKZ QLQBZDMPO6

sampling of nodes or edges from it. In summary, these techniques 71HY2U89G1 CWNODR0NNG ISAEG1YV4N J6DYRZA2BR 6E5S3AJJB5 O7IWJ9EW39 propose a utility-driven summarization algorithm, and supplement
SLE3ZGPKAW Z483PWJNQR IXIRKTFZRQ 8UXG082N8L 535WON747W RGVYGSSEXS ZN9OTVCDCX N9DKNZ974D 24772 R6H500JJYP 0LF3T2M5CX J1PANTGLBF RZHPNXSZ01 8QJLVYV0YF ST7VTM520B AOJZICZE82 2HH5HRTUY5 EOI01WERU6 OOJRRC8MM4 QKUH24AIGB H9H2R6LIEJ R0CPCRDXJ2 MLP2XB4ASE

estimate the properties of the original graph, estimate relative fre- M0CKBLQG6O VDWJPSI0OF D8G3GP0IM8 U0OX08C535 B7GVDD23F1 I70U0ZW6S7
it with scalable heuristics. Our iterative summarization technique
NYG951S6Z4 DMEHAHHWA3 3TC30K0O52 MSF29ND2D7 Z3VLL7UN44 ZIZUNCD71B CTI9J8SZ7F R2S9U0GVCP KRIX0CICCM 1DR7Y5N0LI EZQGUFMSED 4KTDWYHNPS LFGCOIW8YZ QTATPPLJI4 5E4HJ4LXEV 5D8TR5ACDP 2GPTNBC4WL 8C7TAZLZAA 3BY0UT5O42 TEFXU8ONMV 32415 6S9R656LJK VD3GD6P8W2

quencies of its substructures and then create a small sample subgraph allows a user to query a graph summary with a specified utility
IW2YPEGZ4Q PZK016KUXH 9K8ZPMCSGD JFAOO9T92Y UK08S5I14Q NRUBWZK0IU G7H516BH0V K8ZL1K08YB

that resembles the original graph. Also, there are techniques [10, value. Finally, our experiments and evaluation results on multiple
22] that use linear dimensionality reduction on the complex graph real-world datasets demonstrate the effectiveness of UDS both in
to generate simplified graph sketches or data synopses. terms of quality, performance, and overall practicality.
Grouping-based methods. The approaches [16, 36, 31] focus on Acknowledgement. We thank reviewers and our colleagues Sandeep
the compression problem as a selection of supernodes, superedges Bhatkar, and Matteo Dell’Amico for helpful reviews and comments.

346
8. REFERENCES http://snap.stanford.edu/data, June 2014.
[1] A. Ahmed, N. Shervashidze, S. Narayanamurthy, [20] C. Li, G. Baciu, and Y. Wang. Modulgraph: Modularity-based
V. Josifovski, and A. J. Smola. Distributed large-scale natural visualization of massive graphs. In SIGGRAPH Asia 2015
graph factorization. In International Conference on World Wide Visualization in High Performance Computing, SA, pages
Web, WWW, pages 37–48. ACM, 2013. 11:1–11:4. ACM, 2015.
[2] U. Brandes and D. Fleischer. Centrality measures based on [21] C. T. Li and S. D. Lin. Egocentric information abstraction for
current flow. In Conference on Theoretical Aspects of Computer heterogeneous social networks. In International Conference on
Science, STACS, pages 533–544. Springer-Verlag, 2005. Advances in Social Network Analysis and Mining, ASONAM, pages
[3] D. J. Cook and L. B. Holder. Substructure discovery using 255–260, 2009.
minimum description length and background knowledge. J. [22] E. Liberty. Simple and deterministic matrix sketching. In
Artif. Int. Res., 1(1):231–255, Feb. 1994. International Conference on Knowledge Discovery and Data Mining,
[4] I. Davidson and S. S. Ravi. Intractability and clustering with KDD, pages 581–588. ACM, 2013.
constraints. In International Conference on Machine Learning, [23] X. Liu, Y. Tian, Q. He, W.-C. Lee, and J. McPherson.
ICML, pages 201–208, New York, NY, USA, 2007. ACM. Distributed graph summarization. In International Conference on
[5] C. Dunne and B. Shneiderman. Motif simplification: Conference on Information and Knowledge Management, CIKM,
Improving network visualization readability with fan, pages 799–808. ACM, 2014.
connector, and clique glyphs. In Conference on Human Factors in [24] A. S. Maiya and T. Y. Berger-Wolf. Sampling community
Computing Systems, CHI, pages 3247–3256. ACM, 2013. structure. In International Conference on World Wide Web, WWW,
[6] E. Estrada and N. Hatano. Communicability in complex pages 701–710. ACM, 2010.
networks. Phys. Rev. E, 77:036111, Mar 2008. [25] S. Navlakha, R. Rastogi, and N. Shrivastava. Graph
[7] E. Estrada, D. J Higham, and N. Hatano. Communicability summarization with bounded error. In International Conference
betweenness in complex networks. 388, 05 2009. on Management of Data, SIGMOD, pages 419–432. ACM, 2008.
[8] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. [26] NetworkX developer team. Networkx.
Mitzenmacher. Cuckoo filter: Practically better than bloom. In https://networkx.github.io/, 2014.
International Conference on Emerging Networking Experiments and [27] M. Newman. Networks: An Introduction. Oxford University
Technologies, CoNEXT, pages 75–88. ACM, 2014. Press, Inc., 2010.
[9] L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: [28] M. J. Newman. A measure of betweenness centrality based on
A scalable wide-area web cache sharing protocol. IEEE/ACM random walks. Social Networks, 27(1):39 – 54, 2005.
Trans. Netw., 8(3):281–293, June 2000. [29] C. M. Papadimitriou. Computational complexity.
[10] M. Ghashami, E. Liberty, and J. M. Phillips. Efficient frequent Addison-Wesley, Reading, Massachusetts, 1994.
directions algorithm for sparse matrices. In International [30] M. Purohit, B. A. Prakash, C. Kang, Y. Zhang, and
Conference on Knowledge Discovery and Data Mining, KDD, pages V. Subrahmanian. Fast influence-based coarsening for large
845–854. ACM, 2016. networks. In International Conference on Knowledge Discovery and
[11] M. A. Hasan. Methods and applications of network sampling. Data Mining, KDD, pages 1296–1305. ACM, 2014.
In Optimization Challenges in Complex, Networked and Risky [31] M. Riondato, D. Garcı́a-Soriano, and F. Bonchi. Graph
Systems, chapter 5, pages 115–139. 2016. summarization with quality guarantees. Data Min. Knowl.
[12] M. Hay, G. Miklau, D. Jensen, D. Towsley, and C. Li. Discov., 31(2):314–349, Mar. 2017.
Resisting structural re-identification in anonymized social [32] M. Riondato and E. M. Kornaropoulos. Fast approximation of
networks. The VLDB Journal, 19(6):797–823, Dec. 2010. betweenness centrality through sampling. In International
[13] C. Hübler, H. P. Kriegel, K. Borgwardt, and Z. Ghahramani. Conference on Web Search and Data Mining, WSDM, pages
Metropolis algorithms for representative subgraph sampling. 413–422, New York, NY, USA, 2014. ACM.
In International Conference on Data Mining, ICDM, pages [33] Z. Shen, K.-L. Ma, and T. Eliassi-Rad. Visual analysis of large
283–292. IEEE, 2008. heterogeneous social networks by semantic and structural
[14] D. Koutra, U. Kang, J. Vreeken, and C. Faloutsos. abstraction. IEEE Transactions on Visualization and Computer
Summarizing and understanding large graphs. Stat. Anal. Graphics, 12(6):1427–1439, Nov. 2006.
Data Min., 8(3):183–202, June 2015. [34] C. Staudt, A. Sazonovs, and H. Meyerhenke. Networkit: An
[15] D. Koutra, N. Shah, J. T. Vogelstein, B. Gallagher, and interactive tool suite for high-performance network analysis.
C. Faloutsos. Deltacon: Principled massive-graph similarity CoRR, abs/1403.3005, 2014.
function with attribution. TKDD, 10(3):28:1–28:43, 2016. [35] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation
[16] K. LeFevre and E. Terzi. Grass: Graph structure for graph summarization. In International Conference on
summarization. In SDM, pages 454–465. SIAM, 2010. Management of Data, SIGMOD, pages 567–580. ACM, 2008.
[17] J. Leskovec and C. Faloutsos. Sampling from large graphs. In [36] H. Toivonen, F. Zhou, A. Hartikainen, and A. Hinkka.
International Conference on Knowledge Discovery and Data Mining, Compression of weighted graphs. In International Conference on
KDD, pages 631–636. ACM, 2006. Knowledge Discovery and Data Mining, KDD, pages 965–973.
[18] J. Leskovec and E. Horvitz. Planetary-scale views on a large ACM, 2011.
instant-messaging network. In International Conference on World [37] N. Yan, S. Hasani, A. Asudeh, and C. Li. Generating preview
Wide Web, WWW, pages 915–924. ACM, 2008. tables for entity graphs. In International Conference on
[19] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large Management of Data, SIGMOD, pages 1797–1811. ACM, 2016.
network dataset collection.

347

Das könnte Ihnen auch gefallen