Sie sind auf Seite 1von 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322872094

Implementation of change data capture in ETL process for data warehouse using
HDFS and apache spark

Conference Paper · September 2017


DOI: 10.1109/IWBIS.2017.8275102

CITATIONS READS
0 1,499

4 authors, including:

Denny Denny Siti Aminah


University of Indonesia Faculty of Computer Science, University of Indonesia
12 PUBLICATIONS   70 CITATIONS    5 PUBLICATIONS   9 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Denny Denny on 29 August 2018.

The user has requested enhancement of the downloaded file.


Implementation of Change Data Capture in ETL
Process for Data Warehouse Using HDFS and
Apache Spark
Denny∗ , I Putu Medagia Atmaja† , Ari Saptawijaya‡ , and Siti Aminah§
∗ Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Email: denny@cs.ui.ac.id
† Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Email: i.putu33@ui.ac.id
‡ Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Email: saptawijaya@cs.ui.ac.id
§ Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Email: aminah@cs.ui.ac.id

Abstract—This study aims to increase ETL process efficiency Data transformed


and reduce processing time by applying the method of Change Transformas
i
Data Capture (CDC) in distributed system using Hadoop Dis-
tributed File System (HDFS) and Apache Spark in the data Load

warehouse of Learning Analytics system of Universitas Indonesia.


Usually, increases in the number of records in the data source
result in an increase in ETL processing time for the data Log SCELE Data transformed
Raw data
warehouse system. This condition occurs as a result of inefficient
ETL process using the full load method. Using the full load
method, ETL has to process the same number of records as Log Server Apache Ekstraksi
the number of records in the data sources. The proposed ETL Data Warehouse
Dashboard
model design with the application of CDC method using HDFS Learning
Analytics
and Apache Spark can reduce the amount of data in the ETL SIAK
process. Consequently, the process becomes more efficient and the
ETL processing time is reduced approximately 53% in average.
Index Terms—change data capture, data warehouse, dis-
tributed system, big data, extract transform load
Fig. 1. ETL process in Data Warehouse for Learning Analytics at Universitas
Indonesia.
I. I NTRODUCTION
Learning analytics systems in Universitas Indonesia apply and ultimately loading data into a data warehouse (Jorg dan
data warehouse as a single repository to analyze learning ac- DeBloch, 2008).
tivities in online environment systems. These systems employ Since the size of data in our systems continue to grow, the
data warehouse to cluster learning activity patterns in learning ETL process is taking longer over time. This increase usage
management systems and to predict high risk students. Data of computing resources is caused by the full load approach
from various sources will be processed before they are moved implemented in our ETL processes. The use of the full load
into data warehouse as shown in Figure 1. First, data from method to transfer processed data into the data warehouse
various sources such as academic systems (known as SIAK in causes longer ETL processing time. This method requires ETL
Universitas Indonesia - UI), Moodle-based learning manage- to process the same amount of data as the data from the source.
ment systems (known as SCELE in UI), and authentication The process becomes inefficient because all of the data from
providers are integrated into a data warehouse system. At the warehouse has to be changed every time the ETL process
the moment, there are four instances of SCELE running in is run, even if the data had been previously processed. The
Universitas Indonesia. Then, data warehouse integrates data same data thus repeatedly goes through the ETL process.
from various sources with different data formats into a single The Change Data Capture (CDC) method can be used
view and uniform format. Lastly, data with proper formats to deal with the problems in this ETL process. The CDC
will be loaded into data warehouse repository. These pro- method can replace the inefficient full load method, espe-
cesses are commonly known as Extract, Transform, and Load cially if the ETL process is run periodically. Nevertheless,
(ETL). ETL process involves extracting data from multiple, CDC application on ETL processing still has the potential to
heterogeneous data sources, transforming and cleansing data, increase processing time because with several CDC method
approaches, there are processes where the amount of data
processed is as much as the data source.
One of the techniques that could be used to reduce pro-
cessing time with the CDC method is applying distributed
processes. The use of Apache Spark tools for parallel pro-
cessing and Hadoop File System (HDFS) for distributed
storage could enhance the system’s capacity to process large
amounts of data. Apache Spark can process data in large
amounts using a relational scheme that can be manipulated to
achieve maximum performance. This differs from MapReduce,
which requires manual and declarative optimization to achieve
maximum performance (Armbrust et al., 2015). In this study,
Apache Hadoop was used to facilitate distributed storage
Fig. 2. CDC methods for immediate data extraction (adapted from Ponniah
in order to be able to run parallel processing. MapReduce (2010)).
programming was used for simple processes such as transfer
of data from the database to the HDFS using Apache Spoon.
Apache Spoon is a add-on tools for Hadoop to transfer data cleaning and conforming phase, and to protect and document
from data source into Hadoop environment. data lineage (Kimball dan Ross, 2013). These three important
The main purpose of this research is to reduce time in roles of the ETL in the data warehouse system take most of
ETL process by using CDC with distributed approach. CDC the whole development time. Kimball dan Ross (2013) stated
technique can reduce the amount of data that will be processed that 70% of time and energy are spent only to transform data
in ETL. CDC also can filter updated data to be processed. So, from the source to the data warehouse.
the process in ETL will be more efficient. Meanwhile, Apache
Spark and HDFS will be used to support CDC technique III. C HANGE DATA C APTURE (CDC)
running in distributed system to increase their performance. As previously mentioned, Change Data Capture (CDC)
This paper is organized as follows. Sections II and III dis- methods can be used to improve ETL process. CDC has several
cusses ETL and CDC, respectively. Then, our design and im- definitions. One of the definitions is that CDC is a method to
plementation for distributed ETL are discussed in Section IV. integrate data based on the results of identification, capture
The results of our experiments are discussed in Section V. and delivering only to changes of data in the operational
The comparison between existing model and proposed model system (Tank et al., 2010). Another definition of CDC is
is discussed in Section VI. that it is a technique to monitor the operational data source
that focuses on data changes (Kimball dan Caserta, 2004).
II. ETL P ROCESS IN DATA WAREHOUSE Based on the two definitions, it can be concluded that CDC
The data from the operational system comprise of different is a method to determine and detect changes in the data that
types and structures, so the data from the operational system occurred during a transaction in the operational system.
cannot be directly used in the data warehouse. Thus, the In general, CDC can be used to support the ETL system.
data from the operational system need to be processed prior The goal is to reduce the amount of data processed in the ETL
to entry into the data warehouse. Data processing aims to system. The ETL process can run more efficiently because it
integrate and to clean the data, and to change it into the only processes data that have been changed. This also enables
predetermined structure. Processing of operational data prior more frequent updates from operational databases to the data
to use in the data warehouse is known as Extract, Transform, warehouse.
Load (ETL). Extraction is the initial process in ETL, where
data from the source is transferred into the ETL environment A. CDC Methods
for processing. Transformation is the task of changing the data In general, applications of CDC can be categorized into im-
structures into the predetermined format as well as to improve mediate data extraction and deferred data extraction (Ponniah,
data quality. Load is the term used to refer to the transfer of 2010). Immediate data extraction allows extraction in real time
transformed data into the data warehouse or repository. Load is when a change occurs in the data source. Meanwhile, in the
also known as delivering. According to Kimball dan Caserta deferred data extraction approach, the extraction process is
(2004), delivering is the process of transferring transformed performed periodically (in specified intervals). Therefore, the
data into a dimensional model accessed by the user. According data extracted are those that have been changed since the last
to the structure of the data warehouse using the star scheme, extraction.
the load process can be classified into two processes: the load There are three methods of CDC application for immediate
process for fact tables and for dimension tables. data extraction. Figure 2 depicts the three methods, which
The ETL process as a backend needs to fulfill three im- uses the transaction log, database triggers, and capture in
portant roles, viz., to send the data effectively to the data source application. The CDC method using log transaction
warehouse user, to enhance value to the data during the is a method that utilizes the log recorded in the RDBMS
One of the limitations in applying the CDC method is that it
is too dependent on the system of the data source. Several CDC
methods require configuration from the side of the operational
system. Such method includes the use of timestamp, trigger,
and transaction logs. CDC application using the timestamp
method can only be used if the data has a time record for all
the changes that occurred. If not, there needs to be changes in
the data structure to add a timestamp column, which is one of
the weaknesses of this method (Mekterovic dan Brkic, 2015).
As with timestamp, CDC trigger application requires access
to the data source system to create a function that can detect
changes in the data. In principle, this method works by
utilizing the trigger function that many RDBMS have. Aside
from the drawback of having to create a function, this method
is also limited in the choice of data, as the data have to be in
Fig. 3. CDC methods for deferred data extraction (adapted from Ponniah
(2010)). the form of a database managed by RDBMS.
The other method utilizes the RDBMS transaction log. This
method utilizes the database log as the data source to record
from the data source in the form of a database. This method each change to the database. The drawback of this method is
works by utilizing every event from the Create, Update, Delete that the RDBMS has to be monitored in running its function
(CRUD) operation that is recorded by the RDBMS in a file to record a log of every transaction. This is done to prevent
log. Systems that utilize this method will look for data that loss in transactions (Mekterovic dan Brkic, 2015).
has been changed or added by reading the contents of this file To overcome limitations of these methods, changes can
log. The second method utilizes database triggers. A database be detected by comparing data which is commonly referred
trigger is a procedural function that RDBMS generally has to to using the term snapshot differential. Initially, this process
take action if CRUD has been undertaken on particular data. accesses data using full extract from the beginning to the end
This function can be utilized to propagate updates if there are of the condition. This condition requires large amounts of
changes in the data. The third method is capture in source resources from the CDC and influences the performance and
application. This method utilizes this application or system time for CDC processes. In line with an increase in the amount
from data sources that have the capacity to apply CDC. This of data, more time and computations required to conduct the
method is quite effective to reduce the load in the ETL process, CDC process.
especially during the extraction phase. However, this method Other approaches are delta view and parallel processing.
is limited to applications that can be modified. This method The application of delta view is performed on data sources that
is not applicable to proprietary applications. are in the form of a database. The principle of this approach
CDC application for deferred data extraction is generally is by creating a view that comprises of a key of a record
classified into two type of methods. Figure 3 shows the two that will be inserted in the ETL process. The delta view will
methods. One looks at the data based on timestamp, and the be utilized to store the key of the updated record, deleted
other compares files. CDC application using the first method record and inserted record that will be used as information
utilizes the time column that most data from the source have. to the changes in the ETL process (Mekterovic dan Brkic,
In several cases, the time column can provide more detailed 2015). Even though this approach still requires access to the
information, such as when the data were entered and changed. data source, the delta view approach does not change the data
This information can be easily utilized to detect data changes. structure. Thus, this application is easier to conduct. On the
Unlike the first method, the second method uses a more other hand, parallel processing approaches can overcome the
flexible technique because it does not depend on the attributes problem of the CDC method to process large amounts of data.
of the data source. The technique used compares data from The principle of this method is to reduce load in the system
previous extractions with the current data to detect changes in to classify the process into several resource for simultaneous
each data attribute in order to detect changed data. processing. Thus, the CDC process can be performed in less
time.
B. CDC Limitation
The use of the CDC method is similar to the extraction pro- C. CDC in Distributed Systems
cess in that it needs to adjust to the characteristics of the data A distributed system is a group of subsystem components
source. This condition causes different applications of CDC that are connected in a network that can communicate with
techniques using the same method. Several CDC methods that each other. In a distributed system, each hardware and software
have been previously mentioned cannot be applied to all types communicate with each other and coordinate their processes
of data. Thus, each method has limits that influence the CDC with message passing (Coulouris et al., 2012). The main goal
process. in creating a distributed system is to divide the resources to
enhance system performance. In implementation, there are two �------------------------·
' Map
- -------- c - - - - - • • • -,
Redu e

actions that can be done in a distributed system, which are the


use of distributed file system to increase storage capacity and Intermediary
result Result
parallel processing to increase throughput.
Input Data
Parallel processing can be used to increase the performance
Intermediary
of the CDC process by dividing a large process into smaller result

processes to be conducted simultaneously on different com- Result

puters (nodes) in a cluster. Each node can run the process Intermediary
result
without waiting for the completion of processes in other node.
This reduces the time needed to complete a task. In addition to
saving time, parallel processing can also save resources used ,_ ----------------------_ l
as no large task are performed on a single computer.
There are two ways to implement parallel processing in Fig. 4. MapReduce framework (Coulouris et al., 2012).
the CDC method, which are the MapReduce programming
paradigm with Apache Hadoop and utilizes Spark SQL from
Apache Spark. MapReduce is a programming paradigm that
undertakes parallel computational processes using two main
functions: map and reduce. In essence, the principle of
MapReduce is the same as parallel computational processes in
general. It begins by dividing the data input into several parts,
processing each part, and at the end combining the results
of the processes in one final result of the process (Coulouris
et al., 2012).
Figure 4 demonstrates the process with MapReduce with Fig. 5. The previous ETL process (above) and proposed incremental ETL
map function to take data in the form of a group of key-value process using distributed system (below).
as input to be processed and reduce function to receive input
from the map results that are processed to obtain an output
Spark). The process was run using library SparkSQL from
from that process. The CDC method can be implemented using
Apache Spark and MapReduce programming paradigm from
MapReduce by adopting the divide and conquer principle
Apache Hadoop.
similar to that conducted in the study by Bala et al. (2016).
The data are divided into several parts, each to be processed IV. D ESIGN AND I MPLEMENTATION
separately. Then, each data processed will enter the reduce The proposed ETL process uses incremental extract method
phase which will detect changes in the data. and only process changed data. As shown in Figure 5, the
Alternative to MapReduce, parallel processing in the CDC current ETL process uses full extraction from databases, then
method can be implemented by employing Spark SQL. Spark perform transformation and loading on the whole extracted
SQL is a module from Apache Spark that integrates relational data. The whole ETL process are performed using Pentaho
processes with functions in API Apache Spark (Armbrust Data Integration (PDI). Meanwhile, the proposed ETL process
et al., 2015). Spark SQL has the capacity to utilize query such extract new and changed data using map-reduce framework
as data processing using database. Spark SQL can be utilized and Apache Spoon. The transformation and loading process
to run the CDC method using commonly used operations are performed using PDI. This section elaborates our big data
such as JOIN, FILTER, and OUTER-JOIN, so that CDC cluster environment and the implementation of CDC.
processing using Spark SQL can be more easily implemented
compared to using MapReduce. A. Server Configuration
The use of MapReduce and SparkSQL for parallel process- The distributed system implemented was peer-to-peer. Peer-
ing cannot be done without using distributed storage. This is to-peer is an architectural design where each process has
because each process in the node needs to be able to access the the same role, whereby nodes interact without differentiating
data processed, so each data has to be available and accessible between client and server or computer where an activity is
at each node. Distributed storage keeps the data by replicating run (Coulouris et al., 2012). The main reason for using this
the data into several nodes in a cluster, so that the data can be design is to maximize resource, because with this architecture,
accessed at any node in a cluster. A commonly used platform all nodes undertakes the process simultaneously. Figure 6
for distributed storage is HDFS (Hadoop File System). displays hostname and IP addresses of each server.
CDC method using parallel processing can greatly reduce
the data processing time needed to detect changes. But, it B. Apache Hadoop Configuration
requires lots of configuration and preparation before it is ready Hadoop is a framework that provides facilities for dis-
to be implemented. In this study, this process was implemented tributed storage and processing of large amounts of data using
using a distributed system infrastructure (HDFS and Apache MapReduce programming paradigm. One of the important
Fig. 7. Spark Architecture (Apache Spark, 2013).
Fig. 6. Configuration of distributed system.

allocate the resource into the worker server, while the workers
characteristics of Hadoop is that data partition and compu- execute the process. In this experiment, each server has a role
tation are conducted in several hosts, and Hadoop can run the as worker and only one server serving as a master and also
application in parallel (Shvachko et al., 2010). Hadoop consists as a worker. Because of that, all resources in this experiment
of two main components, viz., Hadoop Distributed File System will be used and every server are in the same level.
(HDFS), to develop a distributed storage, and MapReduce, Cluster manager is needed by Spark to maximize flexibility
to support parallel computational process. In this study, the in cluster management. Spark can work with a third party
most active component used was HDFS, because the parallel cluster manager such as YARN and Mesos. Spark also has its
processes used Apache Spark and Sqoop. own cluster manager, the standalone scheduler. In this study,
HDFS consists of two parts, namenode and datanode. The Spark was configured using standalone cluster.
role of the namenode is to maintain the tree structure of the file
system and metadata (White, 2015). In addition, the namenode D. Implementation
also maintains the block location of each file allocated on In this study, the method used was by comparing data
the datanodes. Unlike the namenode, the datanode is a blocks and finding differences between two data. The CDC method
storage site that is managed by the namenode. These datanodes using snapshot difference was divided into several stages as
store data and report to the namenode on the block where the illustrated in Figure 8. The first process was to take the data
data is stored. Thus, the data stored in these datanodes can only from the source. The data were taken by full load, which took
be accessed through the namenode. The namenode is generally the entire data from the beginning to the final condition. The
located in a separate server to the datanode. However, in this results of the extraction was entered into the HDFS. The data
study, the cluster approach used was peer-to-peer in order to was processed through the program created to run the CDC
maximize available resource. Thus, all servers were configured process using the snapshot difference technique. Snapshot
as datanode and one server as a namenode, as shown in difference was implemented using outer-join function from
Figure 6. SparkSQL. The program was run by looking at the null value
of the outer-join results of the two records, which will be
C. Apache Spark Configuration
the parameter for the newest data. If in the process damage to
Spark is a platform used for the process conducted in a clus- the data source was detected, the program could automatically
ter. When processing large amounts of data, speed becomes a change the data reference to compare and store the old data
priority. Thus, Spark was designed for rapid data processing. as a snapshot.
One of Spark’s features is its capacity for data processing in
memory (Karau et al., 2015). Aside from that, Spark can be V. E VALUATION AND A NALYSIS
implemented using several programming languages such as The ETL model using the CDC method was tested the same
Java, Scala, Python, and R. In this study, the language and way to test the existing ETL method. Testing was done using
library used was Python version 3.0. script with data from the SCELE log database and the Apache
Spark architecture is generally classified into three parts: Web Server log files, which consist of 1,430,000 rows of data.
Spark core, cluster manager, and supporting components Script simulated addition to the database server by iteration as
(MLib, Spark SQL, Spark Streaming, and Graph Processing) well as based on day, just as in the testing of the existing ETL
as illustrated in Figure 7. Spark core is part of Spark with basic model.
functions, such as task scheduling and memory management.
In Apache Spark there are two roles in the server that have A. Running Time of ETL Process
to be configured. There are master and worker as shown in Based on the testing, extraction, transformation, and load
Figure 6. The master has a function to manage the process and processes in the first and second experiments did not incur
600000

500000

data_source

400000

Number of records
data_processed

300000

200000

100000

0
1 51 101 151 201 251
Trials number

Fig. 8. Our CDC method with snapshot difference approach


Fig. 10. The number of records extracted compared to the ones processed in
the loading stage with snapshot differential.
400

350
500
300
450
250
extract_time transform_time load_time
Time (s)

400
200
TIME_WITH_EXISTING_MODEL TIME_SNAPSHOT
350
150

300
100

Time (s)
250
50

200
0
5,000 505,000 1,005,000
150
Number of records

100
400

50

350 0
1 51 101 151 201 251 301 351 401

Day Number
300 extract_time transform_time load_time

250

Fig. 11. Comparison of the running time of existing ETL method and CDC-
Time (s)

200 based ETL method.

150

100
B. Evaluation on the number of data processed in the loading
stage
50

Unlike the testing on the existing ETL model, the results of


0
1 101 201
Day number
301 401
testing using the ETL model with the CDC method showed
no increase nor reduction in the data processed, even though
the number of record in the data source has increased. The
Fig. 9. Running time of the ETL process with snapshot difference in the graphs in Figures 9 and 10 display the comparisons between
experiment with the source of database increased by 5,000 records (above) the number of data in the data source and the data processed
and daily (below).
during the transformation phase for the two CDC method ap-
proaches in the first and second trials. The graphs demonstrate
that in the two experiments, the increased data source did not
a significant increase in running time. This is because there influence the amount of data processed.
was less data processed compared to without CDC. Figure 9
demonstrates the graphs of the first and second trials. In VI. C OMPARISON OF E XISTING ETL M ETHOD AND
the first trial, the graph demonstrates a relatively constant CDC- BASED ETL M ETHOD
extraction, transformation, and load process due to a steady Our experiment shows that the ETL process proposed in
amount of data increase. This differs from the second trial, this study improves the running time significantly compared
which incurred more fluctuations, albeit not to a significant to the previous ETL model process. This can be seen in the
extent. This is due to an irregular amount of data increase. graph in Figure 11, which shows the growth of running time
The test results from the CDC method using snapshot in the previous ETL model process is much higher compared
difference was similar to testing using the first approach. to the CDC-based ETL method. When the amount of data
Figure 10 shows that the extraction process also tend to was considerably small, the running time of the previous
increase along with the increase in the amount of data, and the ETL model process is faster compared to the CDC-based
processing time for transformation and load stayed relatively ETL process. Nevertheless, once the amount of data from
constant. the source increased, the growth rate of running time for
the previous ETL process is much higher. When data from Karau, H., Konwinski, A., Wendell, P., dan Zaharia, M. (2015).
sources increase until 1,430,000 rows, current ETL process Learning Spark. O’Reilly Media,Inc.
spent time approximately 457 seconds. Meanwhile, the CDC- Kimball, R. dan Caserta, J. (2004). The Data Ware-
based ETL process only requires approximately 133 seconds. house ETL Toolkit : Practical Techniques for Extract-
These proposed method can reduce the running time by 324 ing,Cleaning,Conforming,and Delivering Data. The Data
seconds. In this case, the ETL processing time was reduced Warehouse Toolkit. Wiley Publishing, Inc.
by approximately 53%. This difference would be higher when Kimball, R. dan Ross, M. (2013). The Data Warehouse Toolkit
we use larger datasets in the future. 3rd: The Definitive Guide to Dimensional Modeling. The
Data Warehouse Toolkit. Jonh Willey & Sons, Inc.
VII. C ONCLUSIONS Mekterovic, I. dan Brkic, L. (2015). Delta view generation
Applying the CDC method using HDFS and Apache Spark for incremental loading of large dimensions in a data
on ETL model reduces the increase of running time for warehouse. In 2015 38th International Convention on
ETL processing in data warehouse Learning Analytics at Information and Communication Technology, Electronics
Universitas Indonesia. The cluster configuration with HDFS and Microelectronics (MIPRO), pages 1417–1422.
and Apache Spark as well as ETL model design in this study Ponniah, P. (2010). Data Warehousing Fundamentals For IT
can reduce the total ETL processing time and reduce the num- Professionals Second Edition. Jonh Willey & Sons, Inc.
ber of data processed in the transformation and load stages. Shvachko, K., Kuang, H., Radia, S., dan Chansler, R. (2010).
This makes the ETL process more efficient in terms of data The hadoop distributed file system. In 2010 IEEE 26th
processing, because it only processes new and changed data. Symposium on Mass Storage Systems and Technologies
To overcome the computation required in change detection (MSST), pages 1–10.
in large datasets, distributed storage and computation server Tank, D. M., Ganatra, A., Kosta, Y. P., dan Bhensdadia, C. K.
were used in the proposed method. Moreover, this approach (2010). Speeding etl processing in data warehouses using
does not require changes to the existing systems, such as high-performance joins for changed data capture (cdc).
implementing database triggers. Furthermore, the data stored pages 365–368.
in the distributed file system and change data detected used White, T. (2015). Hadoop: The Definitive Guide, Fourth
by the proposed ETL method can be utilized a backup data Edition. O’Reilly Media,Inc.
and audit trail respectively.
However, the ETL model requires more complicated im-
plementation especially in the CDC method. This is caused
by the types of approaches in the CDC method have to be
adjusted to the structure of the data to be processed and the
conditions of the operational system. If the approach does not
fit, the ETL process will not run efficiently.
This research continues to acquire data from more applica-
tions, such as web access logs. The frequency of data updates
will be increased to real time access.
R EFERENCES
Apache Spark (2013). Spark overview.
https://spark.apache.org/docs/latest/.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley,
J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., dan
Zaharia, M. (2015). Spark sql: Relational data processing
in spark. In Proceedings of the 2015 ACM SIGMOD In-
ternational Conference on Management of Data, SIGMOD
’15, pages 1383–1394, New York, NY, USA. ACM.
Bala, M., Boussaid, O., dan Alimazighi, Z. (2016). Big-
etl:extracting-transforming-loading approach for big data.
Int’l Conf. Par. and Dist. Proc. Tech. and Appl., 8(4):50–69.
Coulouris, G., Dollimore, J., dan Kindberg, T. (2012). Dis-
tributed Systems: Concepts and Design Fifth Edition. Icss
Series. Addison-Wesley.
Jorg, T. dan DeBloch, S. (2008). Towards generating etl pro-
cesses for incremental loading. IDEAS ’08 Proceedings of
the 2008 international symposium on Database engineering
& applications, pages 101–110.

View publication stats

Das könnte Ihnen auch gefallen