Beruflich Dokumente
Kultur Dokumente
net/publication/322872094
Implementation of change data capture in ETL process for data warehouse using
HDFS and apache spark
CITATIONS READS
0 1,499
4 authors, including:
All content following this page was uploaded by Denny Denny on 29 August 2018.
puters (nodes) in a cluster. Each node can run the process Intermediary
result
without waiting for the completion of processes in other node.
This reduces the time needed to complete a task. In addition to
saving time, parallel processing can also save resources used ,_ ----------------------_ l
as no large task are performed on a single computer.
There are two ways to implement parallel processing in Fig. 4. MapReduce framework (Coulouris et al., 2012).
the CDC method, which are the MapReduce programming
paradigm with Apache Hadoop and utilizes Spark SQL from
Apache Spark. MapReduce is a programming paradigm that
undertakes parallel computational processes using two main
functions: map and reduce. In essence, the principle of
MapReduce is the same as parallel computational processes in
general. It begins by dividing the data input into several parts,
processing each part, and at the end combining the results
of the processes in one final result of the process (Coulouris
et al., 2012).
Figure 4 demonstrates the process with MapReduce with Fig. 5. The previous ETL process (above) and proposed incremental ETL
map function to take data in the form of a group of key-value process using distributed system (below).
as input to be processed and reduce function to receive input
from the map results that are processed to obtain an output
Spark). The process was run using library SparkSQL from
from that process. The CDC method can be implemented using
Apache Spark and MapReduce programming paradigm from
MapReduce by adopting the divide and conquer principle
Apache Hadoop.
similar to that conducted in the study by Bala et al. (2016).
The data are divided into several parts, each to be processed IV. D ESIGN AND I MPLEMENTATION
separately. Then, each data processed will enter the reduce The proposed ETL process uses incremental extract method
phase which will detect changes in the data. and only process changed data. As shown in Figure 5, the
Alternative to MapReduce, parallel processing in the CDC current ETL process uses full extraction from databases, then
method can be implemented by employing Spark SQL. Spark perform transformation and loading on the whole extracted
SQL is a module from Apache Spark that integrates relational data. The whole ETL process are performed using Pentaho
processes with functions in API Apache Spark (Armbrust Data Integration (PDI). Meanwhile, the proposed ETL process
et al., 2015). Spark SQL has the capacity to utilize query such extract new and changed data using map-reduce framework
as data processing using database. Spark SQL can be utilized and Apache Spoon. The transformation and loading process
to run the CDC method using commonly used operations are performed using PDI. This section elaborates our big data
such as JOIN, FILTER, and OUTER-JOIN, so that CDC cluster environment and the implementation of CDC.
processing using Spark SQL can be more easily implemented
compared to using MapReduce. A. Server Configuration
The use of MapReduce and SparkSQL for parallel process- The distributed system implemented was peer-to-peer. Peer-
ing cannot be done without using distributed storage. This is to-peer is an architectural design where each process has
because each process in the node needs to be able to access the the same role, whereby nodes interact without differentiating
data processed, so each data has to be available and accessible between client and server or computer where an activity is
at each node. Distributed storage keeps the data by replicating run (Coulouris et al., 2012). The main reason for using this
the data into several nodes in a cluster, so that the data can be design is to maximize resource, because with this architecture,
accessed at any node in a cluster. A commonly used platform all nodes undertakes the process simultaneously. Figure 6
for distributed storage is HDFS (Hadoop File System). displays hostname and IP addresses of each server.
CDC method using parallel processing can greatly reduce
the data processing time needed to detect changes. But, it B. Apache Hadoop Configuration
requires lots of configuration and preparation before it is ready Hadoop is a framework that provides facilities for dis-
to be implemented. In this study, this process was implemented tributed storage and processing of large amounts of data using
using a distributed system infrastructure (HDFS and Apache MapReduce programming paradigm. One of the important
Fig. 7. Spark Architecture (Apache Spark, 2013).
Fig. 6. Configuration of distributed system.
allocate the resource into the worker server, while the workers
characteristics of Hadoop is that data partition and compu- execute the process. In this experiment, each server has a role
tation are conducted in several hosts, and Hadoop can run the as worker and only one server serving as a master and also
application in parallel (Shvachko et al., 2010). Hadoop consists as a worker. Because of that, all resources in this experiment
of two main components, viz., Hadoop Distributed File System will be used and every server are in the same level.
(HDFS), to develop a distributed storage, and MapReduce, Cluster manager is needed by Spark to maximize flexibility
to support parallel computational process. In this study, the in cluster management. Spark can work with a third party
most active component used was HDFS, because the parallel cluster manager such as YARN and Mesos. Spark also has its
processes used Apache Spark and Sqoop. own cluster manager, the standalone scheduler. In this study,
HDFS consists of two parts, namenode and datanode. The Spark was configured using standalone cluster.
role of the namenode is to maintain the tree structure of the file
system and metadata (White, 2015). In addition, the namenode D. Implementation
also maintains the block location of each file allocated on In this study, the method used was by comparing data
the datanodes. Unlike the namenode, the datanode is a blocks and finding differences between two data. The CDC method
storage site that is managed by the namenode. These datanodes using snapshot difference was divided into several stages as
store data and report to the namenode on the block where the illustrated in Figure 8. The first process was to take the data
data is stored. Thus, the data stored in these datanodes can only from the source. The data were taken by full load, which took
be accessed through the namenode. The namenode is generally the entire data from the beginning to the final condition. The
located in a separate server to the datanode. However, in this results of the extraction was entered into the HDFS. The data
study, the cluster approach used was peer-to-peer in order to was processed through the program created to run the CDC
maximize available resource. Thus, all servers were configured process using the snapshot difference technique. Snapshot
as datanode and one server as a namenode, as shown in difference was implemented using outer-join function from
Figure 6. SparkSQL. The program was run by looking at the null value
of the outer-join results of the two records, which will be
C. Apache Spark Configuration
the parameter for the newest data. If in the process damage to
Spark is a platform used for the process conducted in a clus- the data source was detected, the program could automatically
ter. When processing large amounts of data, speed becomes a change the data reference to compare and store the old data
priority. Thus, Spark was designed for rapid data processing. as a snapshot.
One of Spark’s features is its capacity for data processing in
memory (Karau et al., 2015). Aside from that, Spark can be V. E VALUATION AND A NALYSIS
implemented using several programming languages such as The ETL model using the CDC method was tested the same
Java, Scala, Python, and R. In this study, the language and way to test the existing ETL method. Testing was done using
library used was Python version 3.0. script with data from the SCELE log database and the Apache
Spark architecture is generally classified into three parts: Web Server log files, which consist of 1,430,000 rows of data.
Spark core, cluster manager, and supporting components Script simulated addition to the database server by iteration as
(MLib, Spark SQL, Spark Streaming, and Graph Processing) well as based on day, just as in the testing of the existing ETL
as illustrated in Figure 7. Spark core is part of Spark with basic model.
functions, such as task scheduling and memory management.
In Apache Spark there are two roles in the server that have A. Running Time of ETL Process
to be configured. There are master and worker as shown in Based on the testing, extraction, transformation, and load
Figure 6. The master has a function to manage the process and processes in the first and second experiments did not incur
600000
500000
data_source
400000
Number of records
data_processed
300000
200000
100000
0
1 51 101 151 201 251
Trials number
350
500
300
450
250
extract_time transform_time load_time
Time (s)
400
200
TIME_WITH_EXISTING_MODEL TIME_SNAPSHOT
350
150
300
100
Time (s)
250
50
200
0
5,000 505,000 1,005,000
150
Number of records
100
400
50
350 0
1 51 101 151 201 251 301 351 401
Day Number
300 extract_time transform_time load_time
250
Fig. 11. Comparison of the running time of existing ETL method and CDC-
Time (s)
150
100
B. Evaluation on the number of data processed in the loading
stage
50