You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325095439

Migration of data from relational database to graph database

Conference Paper · March 2018


DOI: 10.1145/3200842.3200852

CITATIONS READS
0 131

2 authors:

Yelda Ünal Halit Oğuztüzün


Middle East Technical University Middle East Technical University
1 PUBLICATION   0 CITATIONS    133 PUBLICATIONS   599 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Synthetic Environments View project

Model-Driven Simulation Engineering View project

All content following this page was uploaded by Yelda Ünal on 20 November 2018.

The user has requested enhancement of the downloaded file.


Migration of Data from Relational Database to Graph
Database
Yelda Unal Halit Oguztuzun
TUBITAK BILGEM Department of Computer Engineering
Software Technologies Research Institute Middle East Technical University
Ankara, Turkey Ankara, Turkey
yelda.unal@tubitak.gov.tr oguztuzun@ceng.metu.edu.tr

ABSTRACT* 1 INTRODUCTION
Relational databases have been widely used in many applications Relational database management systems were first released
until today and they have met needs for data-intensive domains in the early 1970's. The relational model has been the most
and transactions, but today data is growing faster than ever and popular and common database model for both commercial and
extracting information from this huge data is becoming more non-commercial applications since it was created. Today, there
challenging. Growing size of data and number of connections are many commercial relational database management systems,
between data items reduces performance because relational such as Oracle, IBM DB2 and Microsoft SQL Server and there are
databases use many complex join operations to query and access also free and open source RDBMS, such as MySQL, PostgreSQL
data. As a solution, graph database store these connections and SQLite. Relational database model is set of tables, rows and
between entities and provide traversing connections fast and columns to organize and store data. A table can be showed as a
easily and accessing data efficiently. This article reports on our matrix of rows and columns, where each intersection of a row and
experience of migration of document-based, parent-child column contains a specific value of data. It is relational since all
hierarchical data from relational database to graph database. It rows share same fields in a table and relationships can be created
also reports comparison of data access processes and performance among the tables to store and retrieve selected data efficiently.
between relational database and graph database. The standard way to access data from a relational database is SQL
(Structured Query Language) query. SQL queries can be used to
CCS CONCEPTS create, read, update and delete data from existing tables.
• Information systems → Data management systems → Today, with the spread of the Internet and digitizing of the
Database design and models → Graph-based database models data, data is growing faster than ever and extracting information
→ Hierarchical data models → Migration from relational from this huge data is becoming more challenging. Growing size
database to graph database, Comparison of data access. of data and number of connections between data reduces
performance in traditional database management systems because
KEYWORDS these databases use many complex join operations to access and
Relational Database, Graph Database, Migration, NoSQL. retrieve data. With the growing size of data, relational database
models started not falling to meet requirements of application
ACM Reference format:
domains that are data intensive and have highly connected data.
Y. Unal and H. Oguztuzun. 2018. Migration of Data from Therefore, researchers started to investigate storage alternatives to
Relational Database to Graph Database. In ICIST ’18: 8th relational databases. NoSQL is a common term for those
International Conference on Information Systems and alternative systems and first began to gain popularity in 2009.
Technologies, March 16–18, 2018, Istanbul, Turkey. ACM, New BigTable, Cassandra, CouchDB and Dynamo are all NoSQL
York, NY, USA, 5 pages. https://doi.org/10.1145/3200842.3200852 projects, as they are huge-volume and highly connected data
stores that eschew relational and object-relational models. Early
adopters of graph technology reimagined their businesses around
Permission to make digital or hard copies of all or part of this work for personal or the value of data relationships. These companies have now
classroom use is granted without fee provided that copies are not made or distributed become industry leaders: LinkedIn, Google, Facebook and
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work owned by others PayPal.
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, Graph database model is set of nodes, edges and properties to
or republish, to post on servers or to redistribute to lists, requires prior specific represent and store data and use graph structure for semantic
permission and/or a fee. Request permissions from Permissions@acm.org.
ICIST ’18, March 16–18, 2018, Istanbul, Turkey queries to retrieve data, based on the NoSQL approach. Graph
© 2018 Association for Computing Machinery. databases are gaining a lot of interest, as they use powerful data
ACM ISBN 978-1-4503-6404-1/18/03. . . $15.00
https://doi.org/10.1145/3200842.3200852 modeling tools that provide a closer fit to real world data. Graph
databases store, process and query connections between data
ICIST '2018, March 16–18, 2018, Istanbul, Turkey Y.Unal and H.Oguztuzun

efficiently by storing relationships as edges in the data model. applications. In this study, MySQL Community Server is used as
While relational databases compute relationships at query time a relational database management system for storing application
through expensive and complex join operations, graph databases data.
stores and process connections as data entity. Neo4J is one of the most popular graph database management
A graph database stores connections as persistent entities. system and it is also one of the most popular NoSQL database
Accessing connections is efficient and allows traversing data system. Neo4J stores and presents data in the form of a graph.
easily without any high-cost join operations. The property graph Data is represented by nodes and relationships between those
contains connected entities as nodes which can hold any number nodes. Neo4J is very suitable for storing and retrieving data that
of attributes (key-value-pairs).Nodes can be tagged with labels has many interconnecting relationships.
representing their different roles in the domain. Relationships
provide directed, named semantically relevant connections
between two node-entities. A relationship always has a direction,
a type, a start node, and an end node. Like nodes, relationships
can have properties.

Figure 2: Data structure in legal document system

3 DATA MODEL FOR THE LEGAL


DOCUMENTS SYSTEM
Legal document systems have lots of different document types
and complex relations between these documents and parts of
documents. In this study, legal document system have eighteen
data entity types and three level hierarchy for each data type. Law
is one of the legal document type in the system and it has
numbered clauses as children entities. Numbered clauses of law
document has paragraph as children entities and also numbered
Figure 1: Property graph model clauses as sub-clauses. Paragraphs of numbered clauses have sub-
Real world domains have many entities, their properties and paragraphs as children in the law document.
complex relationships between entities. Relational database store Figure-2 shows the structure of the legal document system
entity and its properties in structured table. Relationships are data model. Besides parent-child relationships, there are also cross
stored by derived foreign key references. Accessing entity through relations between different type of legal documents and their
other entities requires many join operations. parts. A numbered clause can be related to another numbered
This study aims at migration of document-based, parent-child clause or paragraph of another law document. Every clause or
hierarchical data from relational database to graph database and paragraph can be stored as an item in the model and can be related
compare data access processes and performance between to any other items.
relational database and graph database.
This paper is a comparison of the relative usefulness of the
relational database MySQL and the graph database Neo4j to store
graph data.

2 BACKGROUND
MySQL is one of the most popular open source relational
database, enabling reliable and scalable relational database

2
Migration of Data from Relational Database to Graph Database ICIST '2018, March 16–18, 2018, Istanbul, Turkey

Figure 3: The relational model of the domain


Legal document system data model was designed as a
relational model. This relational model implemented in MySQL Figure 4: The graph model of the domain (Kanun=Law,
database management system. Numbered clauses have tree model Mevzuat=Legislation, Icerik=Contents)
and traverse these clause in the same relational table decreased the
performance. Figure-3 shows the part of the design for law data 4 TRANSFORMATION RULES
type in legal document system. Self-referencing tables and tree- Data migration was implemented in two main steps. First,
hierarchy caused the performance issues in the application while metadata and table data was extracted from relational database
traversing and navigating through data and relations because each management system MySQL by using Schema Crawler and Java
table has huge amount of data. SQL library. Second, data was imported to graph database
management system by using Neo4J API.
As a solution same legal document data model was designed
Extracting data and metadata from Relational Database
as graph model as shown in Figure-4. Relationships between
Management Systems was implemented with the steps shown
parent-child and other entities are frequently queried for this below;
system, therefore using graph model for this system has increased
performance. Navigation through data could be provided easily by 1. JDBC Connection is used to access relational
searching for label and relationship pattern and traversing edges database
between nodes in graph model.
2. Schema Crawler is used to extract the metadata from
As shown in Figure-4, foreign key references are transformed
relational database
to edges between nodes and each nodes have incoming link from
3. Extract Table Data (Table Names, Primary Keys,
the parent node or related part of document. Each table is
Foreign Keys, Fields)
transformed to a node label and each data item in the table is
4. Java SQL library is used to get Result Set data from
transformed to a node instance in graph model.
table

Migration of data to Graph Database Management System


was implemented with below steps;

1. Table data ResultSet is transformed into Node and


Relationship according to transformation rules.
2. Nodes and relationships are stored in two iterables,
namely, InputNodes and InputRelationships
3. Neo4J Parallel Batch Importer API is used to import
data and two generated input iterables are given as
parameters to batch importer.

3
ICIST '2018, March 16–18, 2018, Istanbul, Turkey Y.Unal and H.Oguztuzun

During data migration of legal document system, conversion compare. After legal document system data was migrated to graph
was implemented according to predefined transformation rules. A database data access performance was compared by searching for
migration tool was implemented as a Java application which use the same data value and relationship pattern. Figure-6 shows the
below transformation rules and data was migrated by executing SQL query developed in MySQL database to retrieve all laws
Java application. Metadata information was used to created node from system whose parent name is "Vergi Mevzuat Seti".As
labels and relationship types in graph database and data was used shown in the Figure-6, two join operations are necessary to access
to create node instances and edges between nodes. children data and these tables have huge amount of data and join
operations for these tables decreased the performance.
1. Each entity table is represented by a label on nodes
2. Each row in an entity table is a node
3. Columns on those tables become node properties.
4. Remove technical primary keys, keep domain
primary keys Figure 6: The example of SQL query
5. Add unique constraints for business primary keys,
add indexes for frequent lookup attributes In Figure-7, the same data access operation was implemented
6. Replace foreign keys with relationships to the other in Neo4J database to retrieve all laws from system whose parent
table, remove them afterwards name is "Vergi Mevzuat Seti". This cypher query is more readable
7. Remove data with default values, no need to store and query result was retrieved 10 times faster than relational
those database. As seen in Figure-7, there was no need complex join
8. Data in tables that is denormalized and duplicated operations or traversing all table data for the given entity type
might have to be pulled out into separate nodes to get a during data search. During query execution, first parent node was
cleaner model. accessed whose name is "Vergi Mevzuat Seti". After that, all
9. Join tables are transformed into relationships, children nodes were accessed through only related edges.
columns on those tables become relationship properties

Figure 7: The example of Cypher query

Query development is more straightforward because graph


database stores data in a model which imitates real world objects
and relationships between objects. Cypher query notation is easy
to understand for many complex data access operations compared
to relational database SQL queries. Cypher query development is
more efficient than SQL query development for the tree-hierarchy
data models.
Query execution in graph database is faster than query
execution in relational database for tree-hierarchy data models
such as legal document system data which was used in this study.
Data access duration for one thousand data records in graph
database is six times faster and for ten thousand data records in
graph database is thirty times faster than relational database.
Searching for the given data value between two thousand law data
items has resulted 0.01 second in graph database and this result is
ten times faster than relational database.

6 CONCLUSIONS
In the phase of deciding which database model is most
Figure 5: Transformation of relational model to graph model suitable for a specific domain, data should be investigated by
(Kanun = Law, Hukum = Judgment, Vergi = Tax) considering basic criteria. If data has lots of many-to-many
relationships, using graph database model can be very efficient.
5 DATA QUERY AND PERFORMANCE Graph database can traverse data very efficiently by using
Query development for the same data access was relationship entities while relational database has to use many
implemented in both relational and graph database models to complex and expensive join operations.

4
Migration of Data from Relational Database to Graph Database ICIST '2018, March 16–18, 2018, Istanbul, Turkey

The another criteria is that relational database require a


predefined schema before adding any data to the system while
NoSQL database provide adding data to the system without
needing any predefined schema, therefore for a system which
require frequent schema changes, graph database model should be
used for data storage and retrieval.
The other important consideration is having tables with lots of
columns and a few of these columns are used by rows. Data can
have lots of very different attributes and only a few of them can
be meaningful for some data items. In contrast to relational
database, graph database, stores only meaningful attributes for the
related node and adding data for only used attributes for the
related node increase efficiency.
In this study, the data has tree-like characteristics. Data has
references to the same type of data and this require self-
referencing tables and nested hierarchies in relational model.
Parent-child key values are used to retrieve data has tree-hierarchy
and this reduces performance while number of entity in the table
is very big. Performances issues were experienced while using
legal document system which store every small part of documents
as atomic structure. After data was migrated to graph database
model, parent-child key values were transformed to relationships
between same types of nodes. While self-referencing tables
requires join operation for a huge table for parent-child
relationships, graph database traverse parent and its children
nodes very efficiently using relationships.

REFERENCES
[1] Virgilio, R., Maccioni, A., Torlone, R. (2013, June). Converting relational to
graph databases. In Proceedings of the 1st International Workshop on Graph
Data Management Experiences and Systems Article No.1. ACM.
DOI= https://doi.org/10.1145/2484425.2484426
[2] Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D. (2010,
April). A comparison of a graph database and a relational database: a data
provenance perspective. In Proceedings of the 48st Annual Southeast Regional
Conference Article No.42. ACM.
DOI=https://doi.org/10.1145/1900008.1900067
[3] Neo4J Graph Database Documentation.
https://neo4j.com/developer/graph-database/
[4] Graph Basics for the Relational Developer.
https://neo4j.com/blog/rdbms-graphs-basics-for-relational-developer/
[5] MySQL Relational Database Documentation.
https://dev.mysql.com/doc/refman/5.7/en/
[6] Tutorial on Relational Database Design.
http://www3.ntu.edu.sg/home/ehchua/programming/sql/Relational_Database_D
esign.html
[7] From Relational to Neo4J.
https://neo4j.com/developer/graph-db-vs-rdbms/
[8] Relation Database Overview.
https://docs.oracle.com/javase/tutorial/jdbc/overview/database.html
[9] Tutorial: Import Data into Neo4J.
https://neo4j.com/developer/guide-importing-data-and-etl/

View publication stats