Sie sind auf Seite 1von 22

ETL Transformation

 Transformation is the logic which creates, modifies or


passes data to the defined target structures (tables, files or
any other target).

 The purpose of the transformation is to modify the source


data as per the requirement of target system. It also ensures
the quality of the data being loaded into the target.

 Transformation is classified into two categories, one based


on connectivity, and other based on the change in no of
rows.
Connectivity based types

 Connected Transformations:
Mappings the transformations which are connected to other
transformations are called connected transformations.
 When to use: The connected transformations are preferred when
for every input row, transformation is called or is expected to return a
value.
 For example, for the zip codes in every row, the transformation
returning city name.

 Unconnected Transformations:
Transformations that are not connected to any other transformations
are called unconnected transformations.
 When to use: The unconnected transformations are useful when
their functionality is only required periodically or based upon certain
conditions.
 For example, ca`lculation of the tax details if tax value is not
available.
Transformation types based on the change in
no of rows:
 Active Transformations:
 Active Transformations are those who modifies the data rows and the
number of input rows passed to them.
 If a transformation receives ten number of rows as input, and it returns
fifteen number of rows as an output then it is an active transformation.
 The data in the row is also modified in the active transformation.

 Passive Transformations:
 Passive transformations are those who does not change the number
of input rows.
 In passive transformations the number of input and output rows
remain the same, only data is modified at row level.
 In the passive transformation, no new rows are created, or existing
rows are dropped.
Filter Transformation
(5.6) Data Integration and the Extraction,
Transformation, and Load (ETL) Process
5
 Extraction, transformation, and load (ETL)
A data warehousing process that consists of:
 extraction (i.e., reading data from a database),
 transformation (i.e., converting the extracted data from its
previous form into the form in which it needs to be so that it can
be placed into a data warehouse or simply another database),
and
 load (i.e., putting the data into the data warehouse)
 During extraction process, the input files are written to
a set of staging tables, to facilitate the load process.

2nd semester 2010 Dr. Qusai Abuein


(5.6) Data Integration and the Extraction,
Transformation, and Load (ETL) Process

2nd semester 2010 Dr. Qusai Abuein 6


(5.6) Data Integration and the Extraction,
Transformation, and Load (ETL) Process
7
 Issues affect whether an organization will purchase data
transformation tools or build the transformation
process itself:
 Data transformation tools are expensive

 Data transformation tools may have a long learning


curve
 It is difficult to measure how the IT organization is
doing until it has learned to use the data
transformation tools

2nd semester 2010 Dr. Qusai Abuein


(5.6) Data Integration and the Extraction,
Transformation, and Load (ETL) Process
8
 Important criteria in selecting an ETL tool
 Ability to read from and write to an unlimited number
of data source architectures
 Automatic capturing and delivery of metadata

 A history of conforming to open standards

 An easy-to-use interface for the developer and the


functional user

2nd semester 2010 Dr. Qusai Abuein


Data Extraction
 Often performed by COBOL routines
(not recommended because of high program
maintenance and no automatically generated meta
data)
 Sometimes source data is copied to the target
database using the replication capabilities of
standard RDMS (not recommended because of
“dirty data” in the source systems)
 Increasing performed by specialized ETL software
Reasons for “Dirty” Data
 Dummy Values
 Absence of Data
 Multipurpose Fields
 Cryptic Data
 Contradicting Data
 Inappropriate Use of Address Lines
 Violation of Business Rules
 Reused Primary Keys,
 Non-Unique Identifiers
 Data Integration Problems
Data Cleansing
 Source systems contain “dirty data” that must be
cleansed
 ETL software contains rudimentary data cleansing
capabilities
 Specialized data cleansing software is often used.
Important for performing name and address
correction and householding functions
 Leading data cleansing vendors include Vality
(Integrity), Harte-Hanks (Trillium), and Firstlogic
(i.d.Centric)
Steps in Data Cleansing
 Parsing

 Correcting

 Standardizing

 Matching

 Consolidating
Parsing
 Parsing locates and identifies individual data
elements in the source files and then isolates these
data elements in the target files.
 Examples include parsing the first, middle, and last
name; street number and street name; and city and
state.
Correcting
 Corrects parsed individual data components using
sophisticated data algorithms and secondary data
sources.
 Example include replacing a vanity address and
adding a zip code.
Standardizing
 Standardizing applies conversion routines to
transform data into its preferred (and consistent)
format using both standard and custom business
rules.
 Examples include adding a pre name, replacing a
nickname, and using a preferred street name.
Matching
 Searching and matching records within and across
the parsed, corrected and standardized data based
on predefined business rules to eliminate
duplications.
 Examples include identifying similar names and
addresses.
Consolidating
 Analyzing and identifying relationships between
matched records and consolidating/merging them
into ONE representation.
Data Staging
 Often used as an interim step between data extraction and
later steps
 Accumulates data from asynchronous sources using native
interfaces, flat files, FTP sessions, or other processes
 At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse
 There is usually no end user access to the staging file
 An operational data store may be used for data staging
Data Transformation
 Transforms the data in accordance with the business
rules and standards that have been established
 Example include: format changes, deduplication,
splitting up fields, replacement of codes, derived
values, and aggregates
Data Loading
 Data are physically moved to the data warehouse
 The loading takes place within a “load window”
 The trend is to near real time updates of the data
warehouse as the warehouse is increasingly used for
operational applications
Meta Data
 Data about data
 Needed by both information technology personnel
and users
 IT personnel need to know data sources and
targets; database, table and column names; refresh
schedules; data usage measures; etc.
 Users need to know entity/attribute definitions;
reports/query tools available; report distribution
information; help desk contact information, etc.

Das könnte Ihnen auch gefallen