Sie sind auf Seite 1von 3

Supercomputing for challenging applications, First Assignment.

Luca Barone.
Why is Graph Management Challenging?

Graph models are designed to manage data in areas where the main concern has to do with the
interconnectivity or topology of that data. In these applications, the atomic data and the relations amongst
the units of data have the same level of importance. [1]
Graphs have been used to represent data sets in a wide range of application domains, such as social
science, astronomy, computational biology, telecommunications, semantic web, protein networks, and
many more.
Queries can address direct and explicitly this graph structure. Associated with graphs are specific graph
operations in the query language algebra, such as finding shortest paths, determining certain subgraphs,
and so forth. [2]
Graph data management is about managing, querying, and analyzing a set of entities (nodes) and
interconnections (edges) between them, both of which may have attributes associated with them.
For example, it enables identifying influential persons in a social network, inspecting fraud operations in a
complex interaction network and recognizing product affinities by analyzing community buying patterns.
Nowadays, graphs with millions and billions of nodes and edges have become very common.
For example, Facebook’s graph contains over 1.3 billion users; for every 20 minutes, there are 1 million links
shared, 2 million friend requests generated, and 3 million messages sent. [3]
similarly for Twitter, there are over 0.6 billion users; every second there are 9100 tweets happened; and
people query twitter search engine 2.1 billion times every day. [4]
From these statistics we can state that graphs data have reached hundred millions orders of magnitude,
graph data are updated all the time, graph data present the data uncertainty problem due to external
reasons (ex. data missing) and the internal reason (dynamic changes in graph data) and , even worse,
graph data are much more complex than relational data.[5]
In summary, graph data have four key features: big, dynamic, uncertain and complex [6].
The enormous growth in graph sizes requires huge amounts of computational power to analyze. In practice,
Graphs with high-degree vertices are computationally challenging and contribute heavily on communication
and storage overhead. [2]
The second feature requires that graph search should take dynamic changes and temporal factors into
consideration. The last two features require that graph search should design reasonable models to capture
uncertainties in graph data, and highly efficient algorithms to answer graph search queries on uncertain
graphs. These together make it an extremely challenging task to efficiently manage and analize graphs.

Do we need a supercomputer or similar? In what cases?

As the scale of the data concerned grows, traditional single processor systems, or even parallel systems,
become obsolete, and it is imperative to adopt a distributed approach to data handling and processing.
Many common graph algorithms such as traversals or page rank are easily parallelizable. For example, a BFS
query that starts at a node in the graph with three child nodes can explore the three subtrees in parallel.
And the parallelism increases exponentially the deeper the query explores.
This means that the numbers of cores of a parallel system will soon become a bottleneck on such kinds of
graph computation, while a large-scale distributed system could represent a solution of this problem; also,
a single host can only serve a fixed numbers of I/O ops/sec and, again, a distributed system improves the
performances facing this problem. [7]
Another reason to use a distribuited system is that these Big graphs can’t fit on a single machine's main
memory and for this reason graphs are usually sharded over multiple machines.
Other reasons to use a distributed system are: price and expandability.
It Is more convenient to build a large distributed system and of course, in case of need, the system can
scale up in an instant just by spinning up more nodes.

Which type of Software Stack? How are nodes and links stored?

A graph G = (V; E) can be of two types based on the number of edges it has. G is sparse if jEj is much smaller
2
than O( ¿ V ∨¿ ); otherwise it is dense. The adjacency matrix representation is used for dense graphs
¿
while that of adjacency list is used for sparse graphs. [8]
In case of dense graphs, partitioning of the adjacency matrix is done in such a way that each processor does
an equal amount of work and communication is localized. This lead to proper load balancing and less
communication overhead.
However for sparse graphs the adjacency list representation is essential to reduce the computational
overhead taking the advantage of its sparseness. But then it is very difficult to achieve an even work
distribution.
In-memory systems, whether distributed or shared-memory, store and process the entire graph in RAM.
While this can provide good performance, the size of the graph that can be processed is limited by the
amount of RAM available in the system. Disk-based systems, on the other hand, are not limited by RAM
availability, but sacrifice performance due to disk I/O operations
Sharding is the process of splitting up the data and is helpful when there are some specific set of data that
outgrows either storage or reasonable performance within a single system. [2, 9]
distributed graph processing substantially depends on a suitable data allocation (partitioning) of the graph
data among all nodes of the processing system. This data allocation should enable graph processing with a
minimum of inter-node communication and data transfer while at the same time ensure a good load
balancing such that all nodes can be effectively utilized. The associated optimization objective is to find a
balanced distributed of vertices and their edges such that the each partition includes about the same
number of vertices while the sum of edges crossing partitions is minimized

Is there a common/typical problem for graph algorithms?

A problem graph algorithms may face is that, because graphs represent the relationships between entities
and because these relationships may be irregular and unstructured, the computations and data access
patterns tend not to have very much locality. Performance in contemporary processors is predicated upon
exploiting locality. Thus, high performance can be hard to obtain for graph algorithms[2]
Graph computations are often completely data-driven. The computations performed by a graph algorithm
are dictated by the vertex and edge structure of the graph on which it is operating rather than being directly
expressed in code. As a result, parallelism based on partitioning of computation can be difficult to express
because the structure of computations of the algorithm is not known apriori.[2]
In many common graph algorithms the complexity grows exponentially at each step, thus with big graphs
performances could drop significantly.
In addition, dealing with big graphs, makes them difficult to partition so if we use a distributed system to
take advantage of the possible parallelization, it happens that parts of the graph are on every machine, so
neighboring vertices and even edges could be on different machines, and for these reasons, queries need to
be executed in a distributed way and results need to be merged locally. So distributed algorithms must
handle these problems. [10]

[1] Renzo Angles and Claudio Gutierrez, 2017 . An introduction to Graph Data Management.

[2] Dhananjay Kumar Singh and Ripon Patgiri, 2014. Big graph: tools, techniques, issues, challenges and
future directions.

[3] Sherif Sakr, 2016. Big Data 2.0 Processing Systems: A Survey.

[4] https://www.statisticbrain.com/twitter-statistics/

[5] Shuai MA, Jia LI, Chunming HU , Xuelian LIN, Jinpeng HUAI, 2013. Big Graph Search: Challenges and
Techniques

[6] Shuai MA, Jia LI, 2012. Graph search: A new searching approach to the social computing era.

[7] http://hackingdistributed.com/2014/12/19/to-shard-or-not-to-shard-a-graph/

[8] Aditi Ma jumder. Parallel and Distributed Graph Algorithms and their Applications.

[9]Harshvardhan, Brandon West, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger Parasol Laboratory.
A Hybrid Approach To Processing Big Data Graphs on Memory-Restricted Systems

[10] https://www.voxxed.com/2017/03/handling-billions-of-edges-graph-database/

Das könnte Ihnen auch gefallen