Beruflich Dokumente
Kultur Dokumente
Connected Transformations:
Mappings the transformations which are connected to other
transformations are called connected transformations.
When to use: The connected transformations are preferred when
for every input row, transformation is called or is expected to return a
value.
For example, for the zip codes in every row, the transformation
returning city name.
Unconnected Transformations:
Transformations that are not connected to any other transformations
are called unconnected transformations.
When to use: The unconnected transformations are useful when
their functionality is only required periodically or based upon certain
conditions.
For example, ca`lculation of the tax details if tax value is not
available.
Transformation types based on the change in
no of rows:
Active Transformations:
Active Transformations are those who modifies the data rows and the
number of input rows passed to them.
If a transformation receives ten number of rows as input, and it returns
fifteen number of rows as an output then it is an active transformation.
The data in the row is also modified in the active transformation.
Passive Transformations:
Passive transformations are those who does not change the number
of input rows.
In passive transformations the number of input and output rows
remain the same, only data is modified at row level.
In the passive transformation, no new rows are created, or existing
rows are dropped.
Filter Transformation
(5.6) Data Integration and the Extraction,
Transformation, and Load (ETL) Process
5
Extraction, transformation, and load (ETL)
A data warehousing process that consists of:
extraction (i.e., reading data from a database),
transformation (i.e., converting the extracted data from its
previous form into the form in which it needs to be so that it can
be placed into a data warehouse or simply another database),
and
load (i.e., putting the data into the data warehouse)
During extraction process, the input files are written to
a set of staging tables, to facilitate the load process.
Correcting
Standardizing
Matching
Consolidating
Parsing
Parsing locates and identifies individual data
elements in the source files and then isolates these
data elements in the target files.
Examples include parsing the first, middle, and last
name; street number and street name; and city and
state.
Correcting
Corrects parsed individual data components using
sophisticated data algorithms and secondary data
sources.
Example include replacing a vanity address and
adding a zip code.
Standardizing
Standardizing applies conversion routines to
transform data into its preferred (and consistent)
format using both standard and custom business
rules.
Examples include adding a pre name, replacing a
nickname, and using a preferred street name.
Matching
Searching and matching records within and across
the parsed, corrected and standardized data based
on predefined business rules to eliminate
duplications.
Examples include identifying similar names and
addresses.
Consolidating
Analyzing and identifying relationships between
matched records and consolidating/merging them
into ONE representation.
Data Staging
Often used as an interim step between data extraction and
later steps
Accumulates data from asynchronous sources using native
interfaces, flat files, FTP sessions, or other processes
At a predefined cutoff time, data in the staging file is
transformed and loaded to the warehouse
There is usually no end user access to the staging file
An operational data store may be used for data staging
Data Transformation
Transforms the data in accordance with the business
rules and standards that have been established
Example include: format changes, deduplication,
splitting up fields, replacement of codes, derived
values, and aggregates
Data Loading
Data are physically moved to the data warehouse
The loading takes place within a “load window”
The trend is to near real time updates of the data
warehouse as the warehouse is increasingly used for
operational applications
Meta Data
Data about data
Needed by both information technology personnel
and users
IT personnel need to know data sources and
targets; database, table and column names; refresh
schedules; data usage measures; etc.
Users need to know entity/attribute definitions;
reports/query tools available; report distribution
information; help desk contact information, etc.