SSIS Design Pattern - Incremental Loads: Andy Leonard

Th
is Blog
Andy Leonard
 Home
 About
SSIS and ETL
Thoughts about Database and Software Development, and the tools of the trade.  Email
 Links
SSIS Design Pattern - Incremental Syndication

Loads  RSS 2.0
 Atom 1.0
Introduction Recent Posts

Loading data from a data source to SQL Server is a common task. It's used in Data Warehousing, but in
is being staged in SQL Server for non-Business-Intelligence purposes.  Just a
Haircut?

 SQL
Maintaining data integrity is key when loading data into any database. A common way of accomplishing
Saturday
truncate the destination and reload from the source. While this method ensures data integrity, it also lo
#48
that was just deleted.
Recap

 Business
Incremental loads are a faster and use less server resources. Only new or updated
Losses data is touched in an
load. and "I
Don't
When To Use Incremental Loads Know"
 Project
Use incremental loads whenever you need to load data from a data source to Phoenix
SQL Server.
 Presentin
Incremental loads are the same regardless of which database platform or ETLgtool
Why
updated rows - and separate these from the unchanged rows. Consider
Semantic
Incremental Loads in Transact-SQL Integratio
n? 29 Sep
I will start by demonstrating this with T-SQL: 2010!

0. (Optional, but recommended) Create two databases: a source and destination
Tags database for this dem

(PRODU 
CREATE DATABASE [SSISIncrementalLoad_Source] CT) RED
 24HOP
CREATE DATABASE [SSISIncrementalLoad_Dest]  Agile
 Aiming-
1. Create a source named tblSource with the columns ColID, ColA, ColB, and for-the-
ColC;
top-of-
USE SSISIncrementalLoad_Source the-list
GO  ALM
CREATE TABLE dbo.tblSource  Andy's
(ColID int NOT NULL Crazy
,ColA varchar(10) NULL Questions
,ColB datetime NULL constraint df_ColB default (getDate())  April
,ColC int NULL Fool's
,constraint PK_tblSource primary key clustered (ColID)) Day
 Azure
2. Create a Destination table named tblDest with the columns ColID, ColA, ColB,
BooksColC:
 Business
USE SSISIncrementalLoad_Dest Intelligen
GO ce
CREATE TABLE dbo.tblDest  Change
(ColID int NOT NULL Data
Capture
,ColA varchar(10) NULL
,ColB datetime NULL  Cisco
,ColC int NULL)  Code
Camp

 Complexi
3. Let's load some test data into both tables for demonstration purposes:
ty

 Compone
USE SSISIncrementalLoad_Source
nts
GO
 Custom
Tasks
-- insert an "unchanged" row  Data
INSERT INTO dbo.tblSource Profiing
(ColID,ColA,ColB,ColC) Task
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)  Data
Warehou
-- insert a "changed" row se
INSERT INTO dbo.tblSource  database
(ColID,ColA,ColB,ColC) design
VALUES(1, 'B', '1/1/2007 12:02 AM', -2)  Database
Develope
-- insert a "new" row r
INSERT INTO dbo.tblSource  database
(ColID,ColA,ColB,ColC) developer
s
VALUES(2, 'N', '1/1/2007 12:03 AM', -3)
 Database
Edition
USE SSISIncrementalLoad_Dest  Database
GO Testing
 DBA
-- insert an "unchanged" row  Dell
INSERT INTO dbo.tblDest  Deploym
(ColID,ColA,ColB,ColC) ent
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)  Design
Pattern
-- insert a "changed" row  Develope
INSERT INTO dbo.tblDest r
(ColID,ColA,ColB,ColC) Communi
VALUES(1, 'C', '1/1/2007 12:02 AM', -2) ty
 Doing
4. You can view new rows with the following query: Software
Right
 Elegant
SELECT s.ColID, s.ColA, s.ColB, s.ColC Design
FROM SSISIncrementalLoad_Source.dbo.tblSource s  EMPs
LEFT JOIN SSISIncrementalLoad_Dest.dbo.tblDest d ON d.ColID = s.ColID (Expensi
WHERE d.ColID IS NULL ve
Managem
ent
This should return the "new" row - the one loaded earlier with ColID =
Practices)
and WHERE clauses are the key. Left Joins return all rows on the left side of the join
 Engineers
(SSISIncrementalLoad_Source.dbo.tblSource in this case) whether there's a match on the right side of
 ETL
(SSISIncrementalLoad_Dest.dbo.tblDest in this case) or not. If there is no match on the right side, NUL
 ETL
This is why the WHERE clause works: it goes after rows where the destination ColID is NULL. These row
Instrume
in the LEFT JOIN, therefore they must be new.
ntation
 Excel
This is only an example. You occasionally find database schemas that are thisExpressio
easy to load. Occasionally
time you have to include several columns in the JOIN ON clause to isolate trulyn new rows. Sometimes y
conditions in the WHERE clause to refine the definition of truly new rows. Language
 Geek
Incrementally load the row ("rows" in practice) with the following T-SQL statement:
 I-Am-
Such-A-
INSERT INTO SSISIncrementalLoad_Dest.dbo.tblDest Geek
(ColID, ColA, ColB, ColC)  Incremen
SELECT s.ColID, s.ColA, s.ColB, s.ColC tal
FROM SSISIncrementalLoad_Source.dbo.tblSource s  Installatio
LEFT JOIN SSISIncrementalLoad_Dest.dbo.tblDest d ON d.ColID = s.ColID n
WHERE d.ColID IS NULL  Interview
s

5. There are many ways by which people try to isolate changed rows. The only Laptops
sure-fire way to accomp

compare each field. View changed rows with the following T-SQL statement: Leadershi
p
SELECT d.ColID, d.ColA, d.ColB, d.ColC  LINQ
FROM SSISIncrementalLoad_Dest.dbo.tblDest d  Logic
INNER JOIN SSISIncrementalLoad_Source.dbo.tblSource s ON s.ColID  measure
WHERE ( ment
(d.ColA != s.ColA)  Mentorin
OR (d.ColB != s.ColB) g
OR (d.ColC != s.ColC)  Microsoft
)  MSDN
 MVP
This should return the "changed" row we loaded earlier with ColID = 1 and  ColA
No Pony
= 'C'. Why? The INNER
 PASS
WHERE clauses are to blame - again. The INNER JOIN goes after rows with matching ColID's because o
clause. The WHERE clause refines the resultset, returning only rows where the PASS
ColA's, ColB's,
Board
and the ColID's match. This is important. If there's a difference in any
Elections
to update it.
2010
 PASS
Extract-Transform-Load (ETL) theory has a lot to say about when and how to Summit
update changed data. You
pick up a good book on the topic to learn more about the variations. 2008
 PASS
To update the data in our destination, use the following T-SQL: Summit
2009
UPDATE d  PASS
SET Summit
d.ColA = s.ColA 2010
,d.ColB = s.ColB  PASS
,d.ColC = s.ColC Virtual
FROM SSISIncrementalLoad_Dest.dbo.tblDest d Chapters
INNER JOIN SSISIncrementalLoad_Source.dbo.tblSource s ON s.ColID  Personal
WHERE (  Presentati
(d.ColA != s.ColA) ons
OR (d.ColB != s.ColB)  Project
OR (d.ColC != s.ColC) Managem
) ent
 quality
Incremental Loads in SSIS  Reporting
Services
 Rosario
Let's take a look at how you can accomplish this in SSIS using the Lookup Transformation
 scalable transformations.
functionality) combined with the Conditional Split (for the WHERE clause conditions)
 Service
Packs
Before we begin, let's reset our database tables to their original state using the following query:
 Social
Intelligen
USE SSISIncrementalLoad_Source
ce
GO
 Social
Sunday
TRUNCATE TABLE dbo.tblSource  Software
Business
-- insert an "unchanged" row  software
INSERT INTO dbo.tblSource developer
(ColID,ColA,ColB,ColC) s
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)  Software
Testing
-- insert a "changed" row  SQL
INSERT INTO dbo.tblSource Saturday
(ColID,ColA,ColB,ColC)  SQL
VALUES(1, 'B', '1/1/2007 12:02 AM', -2) Server
 SQL
-- insert a "new" row Server
INSERT INTO dbo.tblSource 2008
(ColID,ColA,ColB,ColC)  SQL
VALUES(2, 'N', '1/1/2007 12:03 AM', -3) Server
Data
USE SSISIncrementalLoad_Dest Service
GO  SQL
Server
TRUNCATE TABLE dbo.tblDest MVP
Deep
Dives
-- insert an "unchanged" row
 SQL
INSERT INTO dbo.tblDest
Universit
(ColID,ColA,ColB,ColC)
y
VALUES(0, 'A', '1/1/2007 12:01 AM', -1)
 SQLLunc
h
-- insert a "changed" row
 SQLRall
INSERT INTO dbo.tblDest y
(ColID,ColA,ColB,ColC)  SQLU
VALUES(1, 'C', '1/1/2007 12:02 AM', -2)  SSDS
 SSIS
 SSIS
Snack Name the project
Next, create a new project using Business Intelligence Development Studio (BIDS).
SSISIncrementalLoad:  SSMS
 Support
 Team
Edition
for
Database
Professio
nals
 Team
System
 Training
 Transpare
ncy
 T-SQL
Snack
 T-SQL
Tuesday
 User
Groups
 utilities
 Virtual
Chapters
 Virtualiza
tion
 Vista
 Visual
Studio
 Visual
Studio
2008
 Windows
7
 Writing
 x64
News
Follow me on
Once the project loads, open Solution Explorer and rename Package1.dtsx to SSISIncrementalLoad.dtsx
Archives
When prompted to rename the package object, click the Yes button. From theOctober
toolbox, drag a Data Flow
Control Flow canvas: 2010 (3)
 Septembe
r 2010
(12)
 August
2010 (14)
 July 2010
(18)
 June
2010 (8)
 May
2010 (12)
 April
2010 (3)
 March
2010 (16)
 February
2010 (20)
 January
2010 (22)
 Decembe
r 2009
(14)
 Novembe
r 2009 (5)
 October
2009 (4)
 Septembe
r 2009 (7)
 August
2009 (4)
 July 2009
(5)
 June
2009 (6)
 May
2009 (1)
 April
2009 (2)
 March
2009 (5)
 February
2009 (5)
 January
2009 (3)
 Decembe
r 2008 (3)
 Novembe
r 2008 (4)
 October
2008 (3)
 Septembe
r 2008 (3)
 August
2008 (4)
 July 2008
(9)
 June
2008 (7)
 May
2008 (12)
 April
2008 (22)
 March
2008 (10)
 February
2008 (8)
 January
2008 (3)
 Decembe
r 2007
(12)

 Novembe
r 2007
Double-click the Data Flow task to edit it. From the toolbox, drag and drop an(13)
OLE DB Source onto the
canvas:  October
2007 (3)
 Septembe
r 2007 (3)
 August
2007 (5)
 July 2007
(7)

Double-click the OLE DB Source connection adapter to edit it:

Click the New button beside the OLE DB Connection Manager dropdown:
Click the New button here to create a new Data Connection:
Enter or select your server name. Connect to the SSISIncrementalLoad_Source database you created e
OK button to return to the Connection Manager configuration dialog. Click the OK button to accept your
Data Connection as the Connection Manager you wish to define. Select "dbo.tblSource" from the Table
Click the OK button to complete defining the OLE DB Source Adapter.
Drag and drop a Lookup Transformation from the toolbox onto the Data Flow canvas. Connect the
adapter to the Lookup transformation by clicking on the OLE DB Source and dragging the green arrow o
and dropping it. Right-click the Lookup transformation and click Edit (or double-click the Lookup transfo
When the editor opens, click the New button beside the OLE DB Connection Manager dropdown (as you
the OLE DB Source Adapter). Define a new Data Connection - this time to the SSISIncrementalLoad_De
After setting up the new Data Connection and Connection Manager, configure the Lookup transformatio
"dbo.tblDest":
Click the Columns tab. On the left side are the columns currently in the SSIS data flow pipeline (from
SSISIncrementalLoad_Source.dbo.tblSource). On the right side are columns available
just configured (from SSISIncrementalLoad_Dest.dbo.tblDest). Follow the following steps:
1. We'll need all the rows returned from the destination table, so check all the checkboxes beside the ro
destination. We need these rows for our WHERE clauses and for our JOIN ON clauses.
2. We do not want to map all the rows between the source and destination - we only want to map the c
ColID between the database tables. The Mappings drawn between the Available Input Columns and Ava
Columns define the JOIN ON clause. Multi-select the Mappings between ColA, ColB, and ColC by clicking
holding the Ctrl key. Right-click any of them and click "Delete Selected Mappings" to delete these colum
JOIN ON clause.
3. Add the text "Dest_" to each column's Output Alias. These rows are being appended to the data flow
so we can distinguish between Source and Destination rows farther down the pipeline:
Next we need to modify our Lookup transformation behavior. By default, the Lookup operates as an INN
we need a LEFT (OUTER) JOIN. Click the "Configure Error Output" button
On the "Lookup Output" row, change the Error column from "Fail component" to "Ignore failure". This te
transformation "If you don't find an INNER JOIN match in the destination table for the Source table's Co
fail." - which also effectively tells the Lookup "Don't act like an INNER JOIN, behave like a LEFT JOIN":
Click OK to complete the Lookup transformation configuration.
From the toolbox, drag and drop a Conditional Split Transformation onto the Data Flow canvas. Connec
the Conditional Split as shown. Right-click the Conditional Split and click Edit to open the Conditional Sp
Expand the NULL Functions folder in the upper right of the Conditional Split Transformation Editor. Expa
folder in the upper left side of the Conditional Split Transformation Editor. Click in the "Output Name" c
"New Rows" as the name of the first output. From the NULL Functions folder, drag and drop the
"ISNULL( <<expression>> )" function to the Condition column of the New Rows condition:
Next, drag Dest_ColID from the columns folder and drop it onto the "<<expression>>" text in the Cond
"New Rows" should now be defined by the condition "ISNULL( [Dest_ColID] )". This
rows - setting it to "WHERE Dest_ColID Is NULL".
Type "Changed Rows" into a second Output Name column. Add the expression "(ColA != Dest_ColA) ||
Dest_ColB) || (ColC != Dest_ColC)" to the Condition column for the Changed Rows output. This defines
clause for detecting changed rows - setting it to "WHERE ((Dest_ColA != ColA) OR (Dest_ColB != ColB)
= ColC))". Note "||" is used to convey "OR" in SSIS Expressions:
Change the "Default output name" from "Conditional Split Default Output" to "Unchanged Rows":
Click the OK button to complete configuration of the Conditional Split transformation.
Drag and drop an OLE DB Destination connection adapter and an OLE DB Command transformation ont
canvas. Click on the Conditional Split and connect it to the OLE DB Destination. A dialog will display pro
select a Conditional Split Output (those outputs you defined in the last step). Select the New Rows outp
Next connect the OLE DB Command transformation to the Conditional Split's "Changed Rows" output:
Your Data Flow canvas should appear similar to the following:

Configure the OLE DB Destination by aiming at the SSISIncrementalLoad_Dest.dbo.tblDest table:
Click the Mappings item in the list to the left. Make sure the ColID, ColA, ColB, and ColC source
their matching destination columns (aren't you glad we prepended "Dest_" to the destination columns?)

Click the OK button to complete configuring the OLE DB Destination connection adapter.
Double-click the OLE DB Command to open the "Advanced Editor for OLE DB Command" dialog. Set the
Manager column to your SSISIncrementalLoad_Dest connection manager:

Click on the "Component Properties" tab. Click the elipsis (button with "...")
The String Value Editor displays. Enter the following parameterized T-SQL statement into the String Va
UPDATE dbo.tblDest
SET
ColA = ?
,ColB = ?
,ColC = ?
WHERE ColID = ?
The question marks in the previous parameterized T-SQL statement map by ordinal to columns named
through "Param_3". Map them as shown below - effectively altering the UPDATE statement
UPDATE SSISIncrementalLoad_Dest.dbo.tblDest
SET
ColA = SSISIncrementalLoad_Source.dbo.ColA
,ColB = SSISIncrementalLoad_Source.dbo.ColB
,ColC = SSISIncrementalLoad_Source.dbo.ColC
WHERE ColID = SSISIncrementalLoad_Source.dbo.ColID
Note the query is executed on a row-by-row basis. For performance with

employ set-based updates instead.
Click the OK button when mapping is completed.
Your Data Flow canvas should look like that pictured below:
If you execute the package with debugging (press F5), the package should succeed and appear as show
Note one row takes the "New Rows" output from the Conditional Split, and one row takes the "Changed
from the Conditional Split transformation. Although not visible, our third source row doesn't change, an
to the "Unchanged Rows" output - which is simply the default Conditional Split output renamed. Any ro
meet any of the predefined conditions in the Conditional Split is sent to the default output.
That's all! Congratulations - you've built an incremental database load! [:)]
Get the code! (Free registration required)
:{> Andy
Published Monday, July 09, 2007 3:13 PM by andyleonard
Filed under: Design Pattern, Incremental, SSIS
Comment Notification
If you would like to receive an email when updates are made to this post, please register
here
Subscribe to this post's comments using RSS
Comments
Jason Haley said:
July 10, 2007 9:09 AM
Jason Haley said:
July 10, 2007 9:10 AM
Alberto Ferrari said:
Andy, maybe you are interested in taking a look at the TableDifference

component I published at http://www.sqlbi.eu.
It is an all-in-one and completely free SSIS component that handles these kind
of situations without the need to cache data in the Lookup. Lookups are nice
but - in real situaton - they may shortly lead to out of memory situations (think
at a hundred million rows table... it simply cannot be cached in memory).
Beware that - for huge table comparison - you will need both TableDifference
AND the FlowSync component that you can find at the same site.
I'll be glad to hear your comments about it.
Alberto
July 12, 2007 5:21 AM

andyleonard said:
Thanks Alberto! Checking it out now.
:{> Andy
July 13, 2007 9:30 PM

David R Buckingham said:
Thank you greatly Andy. This couldn't have come at a better time as I just
started using Integration Services for the first time on Friday to handle eight
different data loads (all for a single client). Four of the data loads are straight
appends, but the other four are incremental.
This approach is vastly superior to loading the incremental data into a

temporary table and then processing it against the destination table. In fact, it
proved to be more efficient than both set-based insert/updates or a cursor-based
approach. Yes, I tested both approaches prior to implementing yours. Your
approach was faster than the set-based insert/updates even though I tested it
across the WAN which suprised me greatly.
I also created a script to assist with the creation of the Conditional Split
"Changed Rows" condition which follows (be sure your results aren't being
truncated when you have a table with many columns):
--- BEGIN SCRIPT ---
DECLARE @Filter varchar(max)
SET @Filter = ''
-- ((ISNULL(<ColumnName>)?"":<ColumnName>)!
=(ISNULL(Dest_<ColumnName>)?"":Dest_<ColumnName>)) ||
SELECT @Filter = @Filter + '((ISNULL(' + c.[name] + ')?"":' + c.[name] + ')!

=(ISNULL(Dest_' + c.[name] + ')?"":Dest_' + c.[name] + ')) || '
FROM sys.tables t
INNER JOIN sys.columns c
ON t.[object_id] = c.[object_id]
WHERE SCHEMA_NAME( t.[schema_id] ) = 'GroupHealth'
AND t.[name] = 'ConsumerDetail'
AND c.[is_identity] = 0
AND c.[is_rowguidcol] = 0
ORDER BY
c.[column_id]
SET @Filter = LEFT( @Filter, LEN( @Filter ) - 2 )
SELECT @Filter
--- END SCRIPT ---
Again, thanks greatly. I now have 2 SSIS books on there way to me. I am
eager to learn as much as I can.
July 17, 2007 3:52 PM

Bill Mo said:
Hello,Andy!Thanks a lot for your incremental process!I'm doing SSIS project!
July 17, 2007 9:47 PM
david boston said:
Thanks this worked a treat for my SSIS project.
July 20, 2007 5:01 AM
andyleonard said:
Hi David, Bill, and David,
Thanks for the feedback!
:{> Andy
August 8, 2007 7:14 PM

saul said:
Hi Andy !! Great work... I was scared because of this Incremental load... and
you saved my weekend... now I can enjoy it .... :-)
September 7, 2007 5:56 PM

Steve Hall said:
Anyone had a problem with the insert and update commands locking each other
out?
Didn't happen at first but does now. Update gets blocked by the insert and it
just hangs.
Steve

andyleonard said:
Thanks Saul!
Steve, are you sure there's not something more happening on the server that's
causing this?
If this is repeatable, please provide more information and I'll be happy to take a
look at it.
SQL Server does a fair job of detecting and managing deadlocks when they
occur. I haven't personally seen SQL Server "hang" since 1998 - and then it
was due to a failing I/O controller.
:{> Andy

Bill Mo said:
Hi,Andy! I have a same problem with Steve,it is block. When bulk insert and
update happen,Update gets blocked by the insert and it just hangs!Insert's wait
type is ASYNC_NETWORK_IO.
October 8, 2007 4:15 AM

Bobby said:
Thx 4 the trick with Fail -> Left Join ! I was thinking how to do it whole day
:o)
October 18, 2007 1:23 AM

Andy Leonard said:
Introduction This post is part of a series of posts on ETL Instrumentation. In

Part 1 we built a database
November 18, 2007 10:53 PM

Michael Ross said:
Steve,
This most certainly can be the case with larger datasets. In my case, I ran into
this issue with large FACT table loads. Either consider dumping the contents
of the insert into a temp table or SSIS RAW datafile and complete the insert in
a separate dataflow task or modify the isolationlevel of the package. Be
warned, make sure you research the IsolationLevel property thoroughly before
making such a change.
November 26, 2007 12:03 PM
Michael said:
What happens when a field is NULL in the destination or source when

determining changed rows? Don't we need special checks to ensure if a
destination field is NULL the source should also be? Thus a change has
occured and the record should be updated?
December 26, 2007 10:26 AM

andyleonard said:
Hi Michael,
Excellent question! This post was intended to cover the principles of

Incremental Loads, and not as a demonstration of production-ready code.
</CheesyExcuse>
There are a couple approaches to handling NULLs in the source or

destination, each with advantages and disadvantages. In my opinion, the chief
consideration is data integrity and the next-to-chief consideration is metadata
integrity.
A good NULL trap can be tricky because NULL == NULL should never
evaluate to True. I know NULL == NULL can evaluate to True with certain
settings, but these settings also have side-effects. And then there's maintenance
to consider... basically, there's no free lunch.
A relatively straightforward method involves identifying a value for the field

that the field will never contain (i.e. -1, "(empty)", or even the string "NULL")
and using that value as a substitute for NULL. In the SSIS expression language
you can write a change-detection expression like:
(ISNULL(Dest_ColA) ? -1 : Dest_ColA) != (ISNULL(ColA) ? -1 : ColA)
But again, if ColA is ever -1 this will evaluate as a change and fire an update.
Why does this matter? Some systems include "number of updated rows" as a
validation metric.
:{> Andy
December 26, 2007 12:50 PM

Michael said:
Hi Andy,
Thanks for this great article!
Do you have any hints for implementing your design with an Oracle Source. I
am attempting to incrementally update from a table with 7 million rows with
~50 fields. The Lookup Task failed when I attempted to use it like you
described above due to a Duplicate Key error...cache is full. I googled this and
found an article suggesting enabling restrictions and enabling smaller cache
amounts. However it is now extremely slow. Do you have any
experience/advice on tweaking the lookup task for my environment?
Is there value in attempting to port this solution to an Oracle to SQL

environment?
Is there a way to speed things up/replace the lookup task by using a SQL
Execution Task which calls a left outer join?
Is there major difference\impact in having multiple primary keys?
Thanks Again
December 26, 2007 1:47 PM

Andy Leonard said:
Now that our 5-month old son - Riley Cooper - is on the mend , I am hitting the
speaking trail again!
January 6, 2008 6:16 PM

Jigs said:
Hi AndY looks great and work also great but if there are more records to
update than it just hangs while doing insert and update so what should i do ..is
there any workaround by which we can avoid hanging od SSIS pacage. Please
Suggest
Thanks
Jigu
January 15, 2008 3:36 PM

andyleonard said:
Hi Bill and Jigu,
Although I mention set-based updates here I did not demonstrate the principle
because I felt the post was already too long - my apologies.
I have since written more on Design Patterns. Part 3 of my series on ETL

Instrumentnation
(http://sqlblog.com/blogs/andy_leonard/archive/2007/11/18/ssis-design-pattern-
etl-instrumentation-part-3.aspx#SetBasedUpdates) demonstrates set-based
updates.
I need to dedicate a post to set-based updates.
:{> Andy
January 16, 2008 7:10 AM

Jai said:
Hi Andy
Thanks you did great help to understand data update through SSIS
package
April 5, 2008 6:16 PM

Kenneth said:
Hi Andy,
I have a hard time following your instructions. Can you send me your sample
project
Thank You
Kenneth
aspken@msn.com
July 29, 2008 1:44 PM

andyleonard said:
Hi Kenneth,
Sorry to hear you're having a hard time with my instructions.
One of the last instructions is a link at the bottom of the page called "Get the
code". It points to this URL:
http://vsteamsystemcentral.com/dnn/Demos/IncrementalLoads/tabid/94/Default
.aspx.
Hope this helps,
Andy
July 29, 2008 1:59 PM

EAD said:
Not sure posted same question few places….May be you gurus can explain
In SSIS Fuzzy grouping objects creates some temp tables and does the Fuzzy
logic. I ran the trace to see how it does in one cursor it is taking very long time
to process 150000 records. Same executes fine in any other test environments.
The cursor is simple and I can post if needed. Any thoughts ?

LNelson said:
I have a similar package I am trying to create and this was a big help. The new
rows write properly however I am getting an error on the changed rows because
the SQL table i am writing to has an auto incremented identity spec column.
The changes won't write to the SQL table. If I uncheck "keep identity" it writes
new rows instead of updating existing. What am I missing?
December 1, 2008 11:38 AM

FDA said:
Thanks a lot of Andy!! Very Helpful!
Rajesh said:
Hi Andy..
Thats the good alternative for slowly changing dimention...!!
Welll done...
What if the increamental is based on more than one columns...?
And further to increase the complications, if any of the column
included in the look up condition changes as well....?
Last one...wht if the row is deleted from source....?
January 6, 2009 3:23 AM

Ken (aspken@msn.com) said:
it looks like your package handles new and updated rows.
I don't see the code handling the deleted rows in source (asume that there is)
Here is my two cents.
in your lookup, you can split out the match and non-match rows.
non match means new record and you can do an insert directly after the lookup.
you can elimninate the 'new row' in your condition in 'conditional split'
However, overall, your sample package is the best (at far as I have searched)
sample on the net ( I love it, honestly).
Keep up the great work and giving out sample package.

Like most people, I do appreciate your efford.
Ken
January 7, 2009 8:10 PM

andyleonard said:
Hi Ken,
Thanks for your kind words.
I believe you're referring to functionality new to the SSIS 2008 Lookup

Transformation - there is no Non-Match Rows output buffer in the SSIS 2005
Lookup Transformation.
:{> Andy
January 7, 2009 9:58 PM

RVS said:
Hi Andy,
Thanks a lot for this article. It proved to be a great help for me.
I was wondering if you can provide some solution to handle deleted rows from
source table using lookup. I need this because I have to keep the historical data
in the data warehouse.
Thanks in advance,
RVS
ranvijay.sahay@gmail.com
January 21, 2009 3:04 AM

Charlie Asbornsen said:
Andy, thanks for your help and effort. This is definitely more elegant than
staging over to one database and then doing ExecuteSQLs to execute
incremental loads.
January 21, 2009 5:16 PM

And re ranvijay's question, I would assume that when the row exists in the
destination but not the source, the source RowID would show up as null, so you
could do that as another split on the conditional.
January 21, 2009 5:18 PM

andyleonard said:
Hi RVS and Charlie,
RVS, Charlie answered your question before I could get to it! I love this
community!
I need to write more on this very topic. New features in SQL Server 2008
change this and make the Deletes as simple as New and Updated rows.
I didn't mention Deletes in this post because the main focus was to get folks
thinking about leveraging the data flow instead of T-SQL-based solutions
(Charlie, in regards to your first comment). There's nothing wrong with T-SQL.
But a data flow is built to buffer (or "paginate") rows. It bites off small chunks,
acts on them, and then takes another bite. This greatly reduces the need to swap
to disk - and we all know the impact of disk I/O on SQL Server performance.
Charlie is correct. The way to do Deletes is to swap the Source and

Destinations in the Correlate / Filter stages.
Typically, I stage Deletes and Updates in a staging table near the table to be
Deleted / Updated. Immediately after the data flow, I add an Execute SQL Task
to perform a correlated (inner joined) update or delete with the target table. I do
this because my simplest option inside a data flow is row-based Updates /
Deletes using the OLE DB Command transformation. A set-based Update /
Delete is a lot faster.
I need to write more about that as well...
:{> Andy
January 21, 2009 5:29 PM

Andy,
Looks like I have some rewriting to do on the next version of the ETL. It's a
good thing I enjoy working in SSIS!
I'm working on building a data warehouse and BI solution for a government

customer, and a lot of their 1970's era upstream data sources don't have ANY
kind of data validation. In fact when we first installed in production we found
out that they had some code fields in their data tables with a single quote for
data! It played merry hob with our insert statements until we figured out what
was happening. Then I got to figure out how to do D-SQL whitelisting with VB
scripting in SSIS :)
Of course since its the government we'll probaby have to wait until 3Q 2010
before we're allowed to upgrade to SQL 2008. We were all gung ho about VS
2008 (which we were allowed to get) but imagine my chagrin when I found out
that I couldn't use my beloved BI Studio without SQL 2008... :P So I'll be
using this for the next version... and possibly the version after that as well.
Thanks a bunch!
January 21, 2009 5:41 PM

Me again.
I think I made a mistake. If a row already exists in the destination table and it
no longer exists in the source table, I want it deleted (sent to the deletes staging
table). However, the lookup limits the row set in memory to items that are
already in the source table, so its not really functioning as an outer join. Its
perfect for determining inserts and updates, but I need to do something else to
do deletes...
I'm going to try adding an additional OLE DB source and point that at the same
table the lookup is checking... hmm, maybe try the Merge? I'll see what
happens and let you know.
January 22, 2009 12:41 PM

Actually I think I need a second pass... grrr.
January 22, 2009 12:44 PM
Charles Asbornsen said:
Andy,
Please feel free to combine this with the previous reply.
What I wound up doing was creating a second data flow after the one that split
the inserts and updates out. The deletes flow populated a deleted rows staging
table with the deleted row id, which then was joined to the ultimate destination
table in a delete command in an Execute SQL task. I would up reversing the
lookup, but used the same technique by using a conditional split on whether or
not the new column from the lookup was null, and if it was, the output went to
the "deleted records" path, which populated the staging table.
The reason I want to actually remove the data from the table as opposed to
merely marking it as deleted is because the reason a row would disappear
would be because it was a bad reference code in the first place. My big
datawarehouse ETL adds new reference codes to the reference tables (which it
needs to create in the first place because the source reference codes are held in
these five gigantic tables which do not lend themselves to generating NV lists)
for unmatched codes in the data tables (remember there's no validation at the
source).
When the reconciliation stick finally gets swung and the customer replaces the
junk code it disappears from my ETL and I remove it from my table. It is
different from a code that gets obsoleted; there's a reason to track those, but
garbage just needs to be thrown out.
Thanks again, I would have been very annoyed with myself if I wound up
doing row-based IUDs...
January 22, 2009 2:55 PM

andyleonard said:
Hi Charles,
I wasn't clear in my earlier response but you figured it out anyway - apologies
and kudos. You do need to do the Delete in another Data Flow Task.
Excellent work!
:{> Andy
January 22, 2009 4:15 PM

Charles Asbornsen said:
Andy,
Is there a limit to how many comparisons you can make in the Conditional Split
Transformation Editor? I have a table with 20 columns, and I'm trying to do 19
comparisons. It's telling me that one of the columns doesn't exist in the input
column collection. I can cut the expression and paste it back in and it picks a
different column to complain about. Error 0xC0010009... it says the expression
cannot be parsed, may contain invalid elements or might not be well formed,
and there may also be an out-of-memory error.
I've been looking at it for 1/2 an hour and all the columns it is variously
complaining about are present in the input column collection, so I suspect it's a
memory error. Should I alias the column names to be shorter (ie the problem is
in the text box) or is it a metadata problem? I'm going home now but tomorrow
I will see if splitting the staging table into 4 tables and splitting the conditions
into 4 outputs (to be recombined later by an execute SQL command into the
real staging table) does what I need.
Thanks!
Charlie
January 22, 2009 5:54 PM

RVS said:
Hi Andy and Charles,
I thank you for your comments. I still have a few doubts related to handling
Deleted columns. I have created a solution to handle all three cases(add,update
and delete). I have taken two OLEDB Source(one with source and data and
another with destination table's data) then I have SORTED them and MERGED
them(with FULL OUTER join) and finally used CONDITIONAL SPLIT to
filter New, updated and Deleted data and used the OLEDB Command to do the
required action. I am getting Deleted rows by using full outer join.
I am getting expected result with this solution but I think this is not
performance efficient as it is using sort, merge etc. I wanted to use Lookup as
suggested by Andy. But the solution which you both have given is not fully
clear to me. Will it be possible for you to send me a sketch of the proposed
solution or explain it a bit in detail?
Charles, regarding no. of comparisons, I don't think it is limited to 19 or 20

because I have used more than 35 comparisons and that is working fine. Please
check if you have checked for null columns correctly.
Thanks once again,
RVS
(ranvijay.sahay@gmail.com)
January 23, 2009 6:57 AM

Doh! Thanks Ranvijay.
January 23, 2009 10:01 AM
Actually what was happening was that since the comparison expression was so
long I moved it into WordPad to type it and then copy/pasted into the rather
annoyingly non-resizable condition field in the conditional split transformation
editor. It turns out it doesn't like that. Maybe there were invisible control
characters in the string, so I needed to just bite the bullet and type in the
textbox. It works fine now.
It would be nice to have a text visualizer for that field.
Thanks!
January 23, 2009 1:51 PM

vidhya said:
This was the excellent article and Andy illustration style is great.
Thank you
June 30, 2009 9:47 AM

Nostromo said:
Great tutorial! I'm new to SSIS and I worked through it without a hitch.
Thanks!!!
July 10, 2009 10:23 AM

DVL said:
Hellow,
Many thanks for the step by step guide.
It's nice to find a way to get your changed and new records in 2 separate
outputs. But how who you get the deleted records? The only solution i found is
to lookup every PK in the source db table and check if it still excists. If it does
it will set the deleted_flag to 1. Do you have any idea to implement the deleted
records into your solution? Mine is in a separete dataflow.
Greetings
August 27, 2009 8:05 AM

CSu said:
Great article! I originally used sort, merge join (with left outer join) and
conditional split transforms to perform incremental load. Unfortunately it did
not work as expected. Your article has simplified my design and it is now
working perfectly. Thanks for sharing. :)
October 26, 2009 7:26 AM

hasan said:
Dear Andy
your solution is great but i have problem. the dimensions are not getting
populated with the default data. does this work on the excel source because i
have an excel source.

Mike said:
Hiya,
Just read the article, confirms my approach to incremental loading on a series

of smallish facts.
I have used the "slowly changing dimension" element in the past to facilitate
the same outcome, ie not using type2s (despite being a fact) - but it is much
slower.
RVA, re: "I am getting expected result with this solution but I think this is not
performance efficient as it is using sort, merge etc"; if the sort(s) are the main
problem, you can do the sort on the database and tell SSIS that the set is sorted
to avoid using two sort dataflow tasks - not sure if that will give you sufficient
gains? The Merge join, as you say, will still be not great within SSIS.
Lastly - has anyone any experience of duplicated KEYS in the source table, that
do not (yet) exist in the destination?
I am performing bulk-inserts after the update/insert evaluation. I have a minor

concern that if I have a key in the source data, that the FIRST record will
correctly INSERT, does the lookup then add this key to memory, so that when
the second key arrives it knows to update?
Because, although I do not constrain the destination table, it will cause

problems within the data (mini carteseans - *shudder*).
Do I need to be aware of any settings or the like? I am about to do a test-case

now - and see what happens...
January 24, 2010 5:35 PM

Mandar said:
Hi Andy,
I want to load data incrementally from source (MySQL 5.2) to SQL Server
2008, using SSIS 2008, based on modified date. Somehow I am not able do it
as MySQL doesn't support parameters. Need some help on this.
-regards, mandar
March 15, 2010 6:40 AM

Ramdas said:
Thank you andy for this tutorial. I am using SSIS 2008, the Lookup task
interface has changed a little bit, when you click on edit on the lookup task, the
opening screen is layed out differently.
March 25, 2010 9:46 AM

KK said:
In my source ID 3 Record has duplicated KEYS so i want first record Insert

and Secode Record should be update in Destination table trough SSIS
Can any one help me to resovle this problem.
When I use SCD 2 type when it read record in target the id 3 record is not
avlable in target so it’s treat for insert for second record also same.
So that record insert two time I don’t want like that I want to first record insert
and scoend record of ID 3 Update.
So any way of resolve this problem .
ID Name Date
1 Kiran 1/1/2010 12:00:00 AM
3 Rama 1/2/2010 12:00:00 AM
2 Dubai 1/2/2010 12:00:00 AM
3 Ramkumar 1/2/2010 12:00:00 AM
March 25, 2010 5:11 PM

Craig said:
I need to incrementally load data from Sybase to SQL. There will be several
hundred million rows. Will this approach work OK with this scenario?
March 30, 2010 10:45 AM

andyleonard said:
Hi Craig,
Maybe, but most likely not. This is one design pattern you can start with. I
would test this, tweak it, and optimize like crazy to get as much performance
out of your server as possible.
:{> Andy
March 30, 2010 10:52 AM

jpedroalmeida said:
Hy there from Portugal,
Andy, i am a starter in SSIS and i found this article very useful and
straightforward in explanation with text and images...
Thanks a lot!!
Cheers
April 25, 2010 11:02 AM

JohnnyReaction said:
Hi Andy
I amended your script to deal with different datatypes (saves a lot of debugging
in the Conditional Split Transformation Editor):
/*
This script assists with the creation of the Conditional Split "Changed Rows"
condition
-- be sure your results aren't being truncated when you have a table with many
columns
*/
--- BEGIN SCRIPT ---
USE master
GO
DECLARE @Filter varchar(max)
SET @Filter = ''
SELECT @Filter = @Filter + '((ISNULL(' + c.[name] + ')?'+
CASE WHEN c.system_type_id IN (35,104,167,175,231,239,241) THEN '""'
WHEN c.system_type_id IN (58,61) THEN '(DT_DBTIMESTAMP)"1900-01-

01"'
ELSE '0' END
+ ':' + c.[name] + ')!=(ISNULL(Dest_' + c.[name] + ')?' +
CASE WHEN c.system_type_id IN (35,104,167,175,231,239,241) THEN '""'
WHEN c.system_type_id IN (58,61) THEN '(DT_DBTIMESTAMP)"1900-01-

01"'
ELSE '0' END
+':Dest_' + c.[name] + ')) || '
FROM sys.tables t
INNER JOIN sys.columns c
ON t.[object_id] = c.[object_id]
WHERE SCHEMA_NAME( t.[schema_id] ) = 'dbo'

AND t.[name] = 'DimUPRTable'
AND c.[is_identity] = 0
AND c.[is_rowguidcol] = 0
ORDER BY
c.[column_id]
SET @Filter = LEFT(@Filter, (LEN(@Filter) - 2))
SELECT @Filter
--SELECT
-- c.*
--FROM
-- sys.tables t
--JOIN
-- sys.columns c
-- ON t.[object_id] = c.[object_id]
--WHERE
-- SCHEMA_NAME( t.[schema_id] ) = 'dbo'
--AND t.[name] = 'DimUPRTable'
--AND c.[is_identity] = 0
--AND c.[is_rowguidcol] = 0
--ORDER BY
--c.[column_id]
--SELECT
-- schemas.name AS [Schema]
-- ,tables.name AS [Table]
-- ,columns.name AS [Column]
-- ,CASE WHEN columns.system_type_id = 34
-- THEN 'byte[]'
-- WHEN columns.system_type_id = 35
-- THEN 'string'
-- THEN 'System.Guid'
-- THEN 'byte'
-- THEN 'short'
-- THEN 'int'
-- THEN 'System.DateTime'
-- THEN 'float'
-- THEN 'decimal'
-- THEN 'System.DateTime'
-- THEN 'double'
-- THEN 'object'
-- THEN 'string'
-- THEN 'bool'
-- THEN 'decimal'
-- THEN 'decimal'
-- THEN 'decimal'
-- THEN 'long'
-- THEN 'byte[]'
-- THEN 'string'
-- THEN 'byte[]'
-- THEN 'string'
-- THEN 'long'

-- THEN 'string'
-- THEN 'string'
-- THEN 'string'
-- THEN 'string'
-- END AS [Type]
-- ,columns.is_nullable AS [Nullable]
--FROM
-- sys.tables tables
--INNER JOIN
-- sys.schemas schemas
--ON (tables.schema_id = schemas.schema_id )
--INNER JOIN
-- sys.columns columns
--ON (columns.object_id = tables.object_id)
--WHERE
-- tables.name <> 'sysdiagrams'
-- AND tables.name <> 'dtproperties'
--ORDER BY
-- [Schema]
-- ,[Table]
-- ,[Column]
-- ,[Type]
July 28, 2010 8:26 AM

Paul Klotka said:
Using T-SQL to do change detection.
I would not use a join to detect change because in the where clause you need to
handle NULL values. For example if ColA in Source is NULL it doesn't matter
what ColA is in the destination, the where clause will return false and not
detect the change.
To get around this I use a union to detect change. Here is an example.
select ColId, ColA, ColB, ColC from Source
union
select ColId, ColA, ColB, ColC from Dest
This returns a distinct set of rows, including handling NULL values. All that is
left is to determine if the ColId appears more than once in the set.
select ColId from (
union
)x
group by ColId
having count(*) > 1
Now I have a list of keys which changed. I can take this list and sort it to use in
a merge join in SSIS or I can use it as a subquery to join back to the Source
table. See below.
select ColId, ColA, ColB, ColC from Source s
inner join (
select ColId from (

union
)x
group by ColId
having count(*) > 1
)y
on s.ColId = y.ColId
July 28, 2010 2:06 PM

Chhavi said:
Thanks for the good explanation and screenshots. I found this website to be
extremly helpful and supportive.
Please let me know if I can learn something more from you and rest of the guys
visiting this website, so that we can become better in SSIS and SQL server
2005 or 2008.
Please provide us similar articles so that we can through them and practice.
Thanks again Andy.
Long Live Andy :)
August 18, 2010 3:59 PM
Leave a Comment
Name (required)*
Your URL (optional)
Comments (required)*
Remember Me?
Submit
About andyleonard
Andy Leonard is an Architect with Molina Medicaid Solutions, SQL Server database
and Integration Services developer, SQL Server MVP, PASS Regional Mentor
(Southeast US), and engineer. He is a co-author of Professional SQL Server 2005
Integration Services and SQL Server MVP Deep Dives.
©2006-2010 SQLblog.com TM
Brought to you by Adam Machanic & Peter DeBetta
Contact Us Privacy Statement

SSIS Design Pattern - Incremental Loads: Andy Leonard

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

SSIS Design Pattern - Incremental Loads: Andy Leonard

Hochgeladen von

Copyright:

Verfügbare Formate

Th

SSIS Design Pattern - Incremental Syndication

Introduction Recent Posts

Double-click the OLE DB Source connection adapter to edit it:

Your Data Flow canvas should appear similar to the following:

Note the query is executed on a row-by-row basis. For performance with

Get the code! (Free registration required)

Subscribe to this post's comments using RSS

Jason Haley said:

July 10, 2007 9:09 AM

Jason Haley said:

July 10, 2007 9:10 AM

Alberto Ferrari said:

Andy, maybe you are interested in taking a look at the TableDifference

I'll be glad to hear your comments about it.

July 12, 2007 5:21 AM

Thanks Alberto! Checking it out now.

July 13, 2007 9:30 PM

This approach is vastly superior to loading the incremental data into a

--- BEGIN SCRIPT ---

DECLARE @Filter varchar(max)

SET @Filter = ''

SELECT @Filter = @Filter + '((ISNULL(' + c.[name] + ')?"":' + c.[name] + ')!

INNER JOIN sys.columns c

WHERE SCHEMA_NAME( t.[schema_id] ) = 'GroupHealth'

AND t.[name] = 'ConsumerDetail'

SET @Filter = LEFT( @Filter, LEN( @Filter ) - 2 )

--- END SCRIPT ---

July 17, 2007 3:52 PM

Hello,Andy!Thanks a lot for your incremental process!I'm doing SSIS project!

July 17, 2007 9:47 PM

david boston said:

Thanks this worked a treat for my SSIS project.

July 20, 2007 5:01 AM

Hi David, Bill, and David,

Thanks for the feedback!

August 8, 2007 7:14 PM

September 7, 2007 5:56 PM

September 18, 2007 1:18 PM

September 27, 2007 6:57 PM

October 8, 2007 4:15 AM

October 18, 2007 1:23 AM

Introduction This post is part of a series of posts on ETL Instrumentation. In

November 18, 2007 10:53 PM

What happens when a field is NULL in the destination or source when

December 26, 2007 10:26 AM

Excellent question! This post was intended to cover the principles of

There are a couple approaches to handling NULLs in the source or

A relatively straightforward method involves identifying a value for the field

(ISNULL(Dest_ColA) ? -1 : Dest_ColA) != (ISNULL(ColA) ? -1 : ColA)

December 26, 2007 12:50 PM

Thanks for this great article!

Is there value in attempting to port this solution to an Oracle to SQL

Is there major difference\impact in having multiple primary keys?

December 26, 2007 1:47 PM

January 6, 2008 6:16 PM

January 15, 2008 3:36 PM

Hi Bill and Jigu,

I have since written more on Design Patterns. Part 3 of my series on ETL

I need to dedicate a post to set-based updates.

January 16, 2008 7:10 AM

April 5, 2008 6:16 PM