You are on page 1of 4

1/21/13

Data Change Capture Process: Full Load vs Incremental Load ClearPeaks Blog

ClearPearks - Leaders in Business Intelligence Solutions CLEARPEAKS.COM

Previous post Next post

Data Change Capture Process: Full Load vs Incremental Load


June 29th, 2011. Jordi U. In todays business world, quick access to data is key in making the right decisions at the right time. This article, based on a real case, highlights the importance of using an incremental load using the change data capture technique for loading data to the dashboards for the end users. Scenario This scenario is located in a telecommunications company, where their requirement is to have dashboards showing the daily and updated data of the marketing and financial department. In order to accomplish this, an ETL process has been developed that loads the data of the different sources to the KPIs database. The different data sources are: Siebel RCM: This contains the transactions for the services and products contracted by the customers. Oracle JD Edwards: This contains the incoming revenue transactions derived from the services and products contracted by the customers and the supplier payment transactions. The dashboards are related to the: CRM area Financial area The objectives of this solution regarding these requirements are: The information of these dashboards should always be updated The ETL process should not take much time The data loaded has to reflect that of the data sources The ETL process should be maintainable The information of these dashboards should always be available The solution: Initial full load and incremental load (CDC) The solution applied was: 1. An initial full load of the data 2. Incremental load: applying ongoing changes and keeping historical data on a predefined schedule. In this case the solution is to apply an incremental load instead of full load (truncate / insert) for the following reasons: Information must be available 24 hours x 7 days The truncate/insert could have a consequence that the information will be not be available for a certain time period. The changes of the data only represents 10% total rows The change in the data capture process doesnt take much time as the changes only represent part of the data. It avoids moving all data every time the process is executed The process will be faster as it doesnt need to load all data every time. It should manage only the changes of the last 3 months The change data capture process doesnt take much time as the process only checks part of the data and not all of it. Moving to incremental load strategy will require a previous analysis:

www.clearpeaks.com/blog/etl/data-change-capture-process-full-load-vs-incremental-load

1/4

1/21/13
Determine which changes to capture:

Data Change Capture Process: Full Load vs Incremental Load ClearPeaks Blog

In this case the data of the tables from the data sources have modifications every day related to the previous day, which is why it has to determine which changes the process has to capture in order to have the data updated every day. The change that has to be captured is the value column of the fact table. It has to compare the value of the incoming row with the value of the existing row. Find an example below:

Design a method to identify changes. The method in this case is identified if the incoming row already exists in the database; if it exists update the value for the value of the incoming row. It compares the values of the columns of the primary key or unique key of the fact table with the incoming data. Determine which changes should be updates and which should be inserts Taking into account the result of the comparison of the columns of the primary key or unique key of the rows, it will label the row with a flag. > If not exist -> FLAG=Insert > If exists -> FLAG=Update Take a look at the time stamping of the rows where you want to do the changes Regarding the requirements and the analysis of the data sources, the modifications can be done up to a maximum of the last 3 months. The process checks if there are changes of the data of the last 3 months avoiding having to check all of the data. The change data capture process will only check changes in the fact table where the data of the incoming row is after the last 3 months of last time date captured This picture illustrates the diagram:

This is an example of the process:

www.clearpeaks.com/blog/etl/data-change-capture-process-full-load-vs-incremental-load

2/4

1/21/13

Data Change Capture Process: Full Load vs Incremental Load ClearPeaks Blog

In conclusion we can deduce that incremental loading is more efficient than full reloading unless the operational data sources happen to change dramatically. Thus, incremental loading is generally preferable. However, the development of ETL jobs for incremental loading is ill-supported by existing ETL tools. In fact, currently separate ETL jobs for initial loading and incremental loading have to be created by ETL programmers. Since incremental load jobs are considerably more complex their development is more costly and error-prone. To overcome this obstacle in this scenario we proposed the Change Data Capture (CDC) technique.
0 0 0 736

Posted under ETL 4 Comments

4 Responses to Data Change Capture Process: Full Load vs Incremental Load

1.

Stephen Coleman says: June 29, 2011 at 6:35 pm This is a great overview of the importance of the incremental load when considering performance costs of truncating/inserting large datasets. ETL tools such as Oracle Warehouse Builder have the ability to set table loading to insert/update that will support both full load and incremental load with the use of the same ETL routines. The key to supporting this is a created table in the staging layer to join to source tables based upon update or create dates of the record.

2.

Harold Jackson says: June 29, 2011 at 9:37 pm Im coming into this discussion a little late, but the answer seems obvious. Why would any solution that required the complete data set to be loaded over and over even be considered? The data from both of these applications resides in Oracle databases. The first thing that I would investigate is employing Oracles Replication technolgy to replicate the transactional data to the staging tables in real time. Every insert, update or delete is immediately captured. The I/O load will be light and latency between systems will be low. Finally, the data will be fresher than waiting for some big process to finish uploading a whole new copy of the data..

3.

Daniel says: June 30, 2011 at 11:42 am Thanks for your comment Stephen, We have a specialized ETL Team and we have applied the technique of CDC for several customers using diferent tools.

4.

Daniel says: June 30, 2011 at 2:36 pm Thanks for your comment Harold, The solution to achieve it could be Oracle GoldenGate tool. It provides high speed data replication between heterogeneous platforms and it allows you to capture and deliver real-time change data to data warehouses.

Leave a Comment
Name (required)

www.clearpeaks.com/blog/etl/data-change-capture-process-full-load-vs-incremental-load

3/4

1/21/13

Data Change Capture Process: Full Load vs Incremental Load ClearPeaks Blog
Mail (will not be published) (required)

Website

Facebook Twitter Linkdn RSS Feed

Blog Categories
Academy (17) Analytics (1) Customer Success Stories (2) Data Management (1) Data Warehousing (8) ETL (11) Events (18) Exalytics (1) General (18) Oracle BI EE (36) Oracle BI EE 11g (26) Reporting (3) Webinars (11)

Most Popular Posts


MS Excel spreadsheets as a data source in Informatica PowerCenter - (18 comments) Configuring OBIEE to work in Single Sign-On (SSO) Environment on IIS - (12 comments) SQL Override: Mapping Reusability in Informatica - (11 comments) Deduplication using Analytic Function - (7 comments) Retrieving non-existent data for a global view of reality - (6 comments) Creating a 100% Stacked Bar Chart in Oracle BIEE 11g - (6 comments) OBI Tables tips & tricks: Hard Coding Zero Values and Combined Request Reports - (6 comments)

privacy policy - Copyright 2000-2010 ClearPeaks

www.clearpeaks.com/blog/etl/data-change-capture-process-full-load-vs-incremental-load

4/4