Sie sind auf Seite 1von 24

Module 4

Designing an ETL Solution


Module Overview

• Overview of ETL
• Planning Data Extraction
• Planning Data Transformation
• Planning Data Loads
Lesson 1: Overview of ETL

• ETL in a BI Project
• Common ETL Data Flow Architectures
• Documenting High-Level Data Flows
• Creating Source To Target Mappings
ETL in a BI Project

Business Requirements

Technical
Data
Architecture Reporting and
Warehouse
and Analysis
and ETL
Infrastructure Design
Design
Design

Monitoring and Optimizing

Operations and Maintenance


Common ETL Data Flow Architectures

• Single-stage ETL
Source DW
• Data is transferred directly from
source to data warehouse
• Transformations and validations
occur in-flight or on extraction
• Two-stage ETL Source Staging DW
• Data is staged for a coordinated
load
• Transformations and validations
occur in-flight, or on staged data Source Landing Zone
• Three-stage ETL
• Data is extracted quickly to a
landing zone, and then staged prior
to loading Staging DW
• Transformations and validation can
occur throughout the data flow
Documenting High-Level Data Flows

ProductDB

Product Subcategory Category

Audit Start
Filter on LastModified
Concatenate Size
Lookup Subcategory Lookup Category Handle NULLs*
(Size + ' ' + MeasureUnit)

Update SCD1 rows ProductName


Update and insert SCD2 rows Category,
(generate surrogate key) Subcategory,
Insert new rows Size, Color
*NULL Handling Rules (generate surrogate
key)
• Change NULL Subcategory and category
Audit End
to "Uncategorized“
• Redirect rows with null ProductName
DimProduct
Creating Source To Target Mappings
Source
Data Source Table Column Data Type

ProductDB Product ProductID int


ProductDB Product Name nvarchar
Landing Zone
ProductDB Product CategoryID int
Table Column Data type Validation Transformation
ProductDB Category CategoryID int
ProductDB Category CategoryName nvarchar
Product AltKey int not null Rename column
Product Name nvarchar
Product Category nvarchar Lookup in Category

Staging
Table Column Data type Validation Transformation

DimProduct AltKey int not null


DimProduct Name nvarchar not null Data Warehouse
DimProduct Category nvarchar Table Column Data type Validation Transformation
DimProduct PrdKey int not null identity
DimProduct AltKey int not null
DimProduct Name nvarchar not null
DimProduct Category nvarchar
Lesson 2: Planning Data Extraction

• Profiling Source Systems


• Identifying New and Modified Rows
• Planning Extraction Windows
Profiling Source Systems

• What data sources are there, and how will the


ETL solution connect to them?
• What data types and formats are used in each
source system?
• What data integrity and validation issues exist in
the source data?
Identifying New and Modified Rows

• Data modification time fields


• Database modification tracking functionality
• Custom change detection functionality
Planning Extraction Windows

• How frequently is new data generated in the


source systems, and for how long is it retained?
• What latency between changes in source system
and reporting is tolerable?
• How long does data extraction take?
• During what time periods are source systems
least heavily used?
Lesson 3: Planning Data Transformation

• Where to Perform Transformations


• Transact-SQL vs. Data Flow Transformations
• Handling Invalid Rows and Errors
• Logging Audit Information
Where to Perform Transformations

• On extraction
Source
• From source
• From landing zone
• From staging Landing

• In data flow
Zone

• Source to landing zone


• Landing zone to staging
Staging
• Staging to data warehouse

• In-place
• In landing zone
Data
• In staging Warehouse
Transact-SQL vs. Data Flow Transformations

• Use Transact-SQL
SELECT CAST(c.CustomerID AS nvarchar(5)) AS CustomerAltKey,
CONVERT(nvarchar(50), c.FirstName + ' ' + c.LastName) AS CustomerName,
ISNULL(m.MembershipLevelName, 'Unknown') AS MembershipLevel
FROM src.Customers AS c
LEFT OUTER JOIN src.MembershipLevels AS m
ON c.MembershipLevel = m.MembershipLevelID;

• Use transformations in SQL Server Integration Services


Handling Invalid Rows and Errors

• Redirect invalid rows in the data flow


• Use the Conditional Split transformation to validate
column values
• Use the No Match Output of a Lookup transformation
to redirect rows for which there is no matching value
• Use the Error Output to redirect rows that cause data
flow errors
• Handling errors
• SSIS Catalog
• Custom solution
Logging Audit Information

• Use SSIS logging to record package execution


events
• Consider using an audit dimension to track
inserts and updates in data warehouse tables
Lesson 4: Planning Data Loads

• Minimizing Logging
• Loading Indexed Tables
• Loading Partitioned Fact Tables
Minimizing Logging

• Set the data warehouse recovery mode to simple


or bulk-logged
• Consider enabling trace flag 610
• Use a bulk load operation to insert data:
• An SSIS data flow destination with Fast Load option
• The bulk copy program (BCP)
• The BULK INSERT statement
• The INSERT … SELECT statement
• The SELECT INTO statement
• The MERGE statement
Loading Indexed Tables

• Consider dropping and recreating indexes for


large volumes of new data
• Sort data by the clustering key and specify the
ORDER hint
• Columnstore indexes make the table read-only
Loading Partitioned Fact Tables

• Switch loaded tables into partitions


• Use the same filegroup for the load table and the
partition
• Use a check constraint on the load table to ensure that
the data falls into the correct range (typically a date
range)
• Partition-align indexed views to avoid dropping
and recreating them when you perform data
loads
Lab: Designing an ETL Solution

• Exercise 1: Preparing for ETL Design


• Exercise 2: Creating Source to Target
Documentation
• Exercise 3: Using SSIS To Load a Partitioned Fact
Table

Logon Information
Start 20467D-MIA-DC and 20467D-MIA-SQL, and
then log onto 20467D-MIA-SQL as
ADVENTUREWORKS\Student with the password
Pa$$w0rd.

Estimated Time: 120 Minutes


Lab Scenario

You have designed a data warehouse for Adventure Works


Cycles and must now design the ETL processes that will
load data from source systems into the data warehouse.
You have decided to focus your design initially on the
Reseller Sales and Internet Sales dimensional models in the
data warehouse, so you can ignore the financial accounts
and marketing campaigns fact tables and their related
dimension tables. The source data is in a number of
sources, and you must examine each one to determine the
columns and data types and discover any data validation
or quality issues. Then you must design the ETL data flows
for the tables involved in the Reseller Sales and Internet
Sales dimensional models. Finally, you must design SSIS
packages to load data into the partitioned fact tables.
Lab Review

• Compare the source-to-target documentation in


the D:\Labfiles\Lab04\Solution folder with your
own documentation. What significant differences
are there in the suggested solutions compared to
your own, and how would you justify your own
solutions?
• How might your design of the SSIS package that
loads the FactResellerSales table have differed if
the table was partitioned on OrderDateKey
instead of ShipDateKey?
Module Review and Takeaways

• Review Question(s)

Das könnte Ihnen auch gefallen