You are on page 1of 56

ETL Concepts

What is ETL? Need for ETL ETL Glossary The ETL Process Data Extraction and Preparation Data Cleansing Data Transformation Data Load Data Refresh Strategies ETL Solution Options Characteristics of ETL Tools

Scope of the Training

ETL stands for Extraction, Transformation and Load This is the most challenging, costly and time consuming step towards building any type of Data warehouse. This step usually determines the success or failure of a Data warehouse because any analysis lays a lot of importance on data and the quality of data that is being analyzed.

What is ETL?

What is ETL?

Extraction The process of culling out data that is required for the Data Warehouse from the source system Can be to a file or to a database Could involve some degree of cleansing or transformation Can be automated since it becomes repetitive once established

What is ETL? - Extraction

What is ETL? - Transformation & Cleansing


Transformation Modification or transformation of data being imported into the Data Warehouse. Usually done with the purpose of ensuring clean and consistent data Cleansing The process of removing errors and inconsistencies from data being imported to a data warehouse Could involve multiple stages

What is ETL? - Loading


After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse Issues huge volumes of data to be loaded small time window available when warehouse can be taken off line (usually nights) when to build index and summary tables

What is ETL? - Loading Techniques


Techniques batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables sequential loads still too long use parallelism and incremental techniques

Facilitates Integration of data from various data sources for building a Datawarehouse Note: Mergers and acquisitions also create disparities in data representation and pose more difficult challenges in ETL. Businesses have data in multiple databases with different codification and formats Transformation is required to convert and to summarize operational data into a consistent, business oriented format Pre-Computation of any derived

The Need for ETL

The Need for ETL - Example


Data Warehouse
appl A - m,f appl B - 1,0 appl C - x,y appl D - male, female appl A - pipeline - cm appl B - pipeline - in appl C - pipeline - feet appl D - pipeline - yds appl A - balance appl B - bal appl C - currbal appl D - balcurr

Same person, different spellings Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names mumbai, bombay Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of

Data Integrity Problems - Scenarios

ETL Glossary
Extracting Conditioning House holding Enrichment Scoring

ETL Glossary
Extracting

Capture of data from operational source in as is status


Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Conditioning The conversion of data types from the

House holding Identifying all members of a household (living at the same address) Ensures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh. Enrichment Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C.

ETL Glossary

Access data dictionaries defining source files

The ETL Process

Build logical and physical data models for target data


Identify sources of data from existing systems Specify business and technical rules for data extraction, conversion and transformation Perform data extraction and

Pull :- A Pull strategy is initiated by the Target System. As a part of the Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources.
Advantage :- No additional space required to store the data that needs to be loaded into to the staging database Disadvantage :- Burden on the Transactional systems when we want to load data into the staging database

The ETL Process Push vs. Pull

OR

The ETL Process Push vs. Pull

With a PUSH strategy, the source system area maintains the application to read the source and create an interface file that is presented to your ETL. With a PULL strategy, the DW maintains the application to read the source.

The ETL Process - Data Extraction and Preparation


Stage I
Extract

Stage II

Analyze, Clean and Transform

Periodic Refresh/ Update

Stage III

Data Movement and Load

The ETL Process A simplified picture


OLTP Systems

Transform
Staging Area Data Warehouse

OLTP Systems

Extract

Load

OLTP Systems

Stage I

Stage II

Stage III

The ETL Process Step1

Capture = extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

Static extract = capturing a snapshot of the source data at a point in time

Incremental extract = capturing changes that have occurred since the last static extract

The ETL Process Step2

Scrub = cleanseuses pattern recognition and AI techniques to upgrade data quality

Fixing errors: misspellings, erroneous


dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also: decoding, reformatting, time


stamping, conversion, key generation, merging, error detection/logging, locating missing data

The ETL Process Step3

Transform = convert data from format of operational system to format of data warehouse

Record-level:
Selection data partitioning Joining data combining Aggregation data summarization

Field-level:
Single-field from one field to one field Multi-field from many fields to one, or one field to many

The ETL Process Step4

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode: bulk rewriting of target


data at periodic intervals

Update mode: only changes in source


data are written to data warehouse

The ETL Process - Data Transformation


Transforms the data in accordance with the business rules and standards that have been established Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates

Sophisticated transformation tools used for improving the quality of data Clean data is vital for the success of the warehouse Example Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person

Scrubbing/Cleansing Data

Reasons for Dirty data


Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys Non-Unique Identifiers Data Integration Problems

The ETL Process - Data Cleansing


Source systems contain dirty data that must be cleansed ETL software contains rudimentary data cleansing capabilities Specialized data cleansing software is often used. Important for performing name and address correction and house holding functions Leading data cleansing/Quality Technology vendors include IBM

Steps in Data Cleansing


Parsing Correcting Standardizing Matching Consolidating

Parsing
Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Correcting
Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. Example include replacing a vanity address and adding a zip code.

Standardizing
Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. Examples include adding a pre name, replacing a nickname, and using a preferred street name.

Matching
Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. Examples include identifying similar names and addresses.

Consolidating
Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Data Quality Technology Tools (Vendors)


DataFlux Integration Server & dfPower Studio (www.DataFlux.com) Trillium Software Discovery & Trillium Software System (www.trilliumsoftware.com) ProfileStage & QualityStage (www.ascential.com)

MarketScope Update: Data Quality Technology ratings, 2005 (Source: Gartner - June 2005)

The ETL Process - Data Loading


Data are physically moved to the data warehouse The loading takes place within a load window The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications

First load is a complex exercise Data extracted from tapes, files, archives etc. First time load might take a lot of time to complete

Data Loading - First Time Load

Data Refresh
Issues: when to refresh? on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after significant events refresh policy set by administrator

Data refreshing can follow two approaches : Complete Data Refresh Completely refresh the target table every time Data Trickle Load Replicate only net changes and update the target database

Data Refresh

Snapshot Approach - Full extract from base tables read entire source table or database: expensive may be the only choice for legacy databases or files. Incremental techniques (related to work on active DBs) detect & propagate changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator) snapshots & triggers (Oracle)

Data Refresh Techniques

ETL Solution Options


ETL

Custom Solution

Generic Solution

Using RDBMS staging tables and stored procedures Programming languages like C, C++, Perl, Visual Basic etc Building a code generator

Custom Solution

Custom Solution Typical components


Extract From Source Data Quality Generate Download Files Snapshots for dimension tables PL/SQL extraction procedure Complex views for transformation Control table and highly parameterized/generic extraction process Control table driven Highly configurable process PL/SQL procedure Checks performed - referential integrity, Y2K, elementary statistics, business rules Mechanism to flag the records as bad / reject Multiple Stars extracted as separate groups Pro*C programs using embedded SQL Surrogate key generation mechanism ASCII file downloads generated for load into warehouse

Control Program

Time window based extraction Restart at point of failure High level of error handling Control metadata captured in Oracle tables Facility to launch failure recovery programs Automatically

Address limitations (in scalability & complexity) of manual coding The need to deliver quantifiable business value Functionality, Reliability and Viability are no longer major issues

Generic Solution

Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment Support data extraction, cleansing, aggregation, reorganization, transformation, and load operations

Characteristics of ETL Tools

First-generation

Types of ETL Tools


Code-generation products Generate the source code

Second-generation

Engine-driven products
Generate directly executable code

Note: Due to more efficient architecture, second generation tools have significant

Types of ETL Tools - FirstGeneration Extraction, transformation,load process


run on server or host GUI interface is used to define extraction/ transformation processes Detailed transformations require coding in COBOL or C Extract program is generated automatically as source code. Source code is compiled, scheduled, and run in batch mode Uses intermediate files

First-Generation ETL Tools Limitations Strengths Strengths and Limitations


Tools are mature Programmers are

familiar with code


generation in COBOL or C

High cost of products Complex training Extract programs have to compiled from source Many transformations have to coded manually Lack of parallel execution support Most metadata to be manually generated

First-Generation ETL Tools Examples SAS/Warehouse Administrator


Prism from Prism Solutions Passport from Apertus Carleton Corp ETI-EXTRACT Tool Suite from Evolutionary Technologies Copy Manager from Information Builders

Types of ETL Tools - SecondGeneration Extraction/Transformation/Load runs on


server Data directly extracted from source and processed on server Data transformation in memory and written directly to warehouse database. High throughput since intermediate files are not used Directly executable code Support for monitoring, scheduling, extraction, scrubbing, transformation,

Second-Generation ETL Tools Limitations Strengths Strengths and Limitations


Lower cost suites, platforms, and environment Fast, efficient, and
Not mature

Initial tools oriented only to


RDBMS sources

multi-threaded
ETL functions highly

Second-Generation ETL Tools Examples PowerMart from Informatica


DataStage from Ardent Data Mart Solution from Sagent Technology Tapestry from D2K

ETL Tools - Examples


DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from Hummingbird Communications

ETL Tool - General Selection criteria


Business Vision/Considerations Overall IT strategy/Architecture Over all cost of Ownership Vendor Positioning in the Market Performance In-house Expertise available User friendliness Training requirements to existing users References from other customers

Support to retrieve, cleanse, transform, summarize, aggregate, and load data Engine-driven products for fast, parallel operation Generate and manage central metadata repository Open metadata exchange architecture Provide end-users with access to metadata in business terms Support development of logical and physical data models

ETL Tool Specific Selection criteria

ETL Tool - Selection criteria


High Rating Low Rating Metadata Management and Administration Data Extraction & Integration complexity

Operations Management/ Process Automation

Data Transformation and Repair Complexity

Target Database Loading ETI Extract SAS Warehouse Administrator Ardent Warehouse Executive Carleton Pureview Source: Gartner Report

Ease of Use / Development Capabilities Informatica PowerCenter

Platinum Decision Base


Ardent DataStage Data Mirror Transformation Server