Sie sind auf Seite 1von 42

Data Warehouse Fundamentals and ETL

Tushar Kant
Why Data Warehousing?
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?

What product prom- Which customers

-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
Why Data Warehousing?
Intelligence Service
 Helping management in
decisions making process
• Used to manage and control business
• Used by managers and end-users to
understand the business and make
 Identifying critical factors in
achieving the organizational
goals or mission
 Knows the tactical and strategic
aspects of the business and the
Benefi ts of D at a Wareh ousi ng
Intelligence services helps organization to
 Understand business trends and make better
forecasting decisions
 Bring better products to market in a more
timely manner
 Analyze daily sales information and make
quick decisions that can significantly affect
your company's performance
 Data warehousing can be a key differentiator
in many different industries. At present,
some of the most popular Data warehouse
applications include:
 sales and marketing analysis across all
 inventory turn and product tracking in
 category management, vendor analysis, and
marketing program effectiveness analysis in
Impac ts of Data Ware ho usi ng
 Potential high returns on investment

 Competitive advantage

 Increased productivity of corporate

What is Data Warehouse
 A data warehouse is
• relational/multidimensional database,
designed for query and analysis rather
than transaction processing
• contains historical data that is derived
from transaction data
• Separates analysis workload from
transaction workload and enables a
business to consolidate data from
several sources.
Tra nsa ctio n Sy ste m vs. Da ta
Wa re ho use
Operational Decision support

Data Content Current values Archival, summarized, calculated data

Data Organization Application by application Subject areas across enterprise

Nature of Data Dynamic Static until refreshed

Data Structure & Complex; suitable for Simple; suitable for business analysis
Format operational computation
Access Probability High Moderate to low

Data Update Updated on a field-by-field basis Accessed and manipulated; no direct

Usage Highly structured repetitive Highly unstructured analytical
processing processing
Response Time Sub-second to 2-3 seconds Seconds to minutes
 Described as the "single point of
truth", the "corporate memory",
the sole historical register of
virtually all transactions that occur
in the life of an organization
 Extraction of Informational data
from Operational or transactional
data to do intelligence
Extraction of Knowledge from Operational Data
Tr ans for ming oper at ional d at a
int o i nf orm at ional da ta
 The creation of new fields that
are derived from existing
operational data
 Summarizing data to the most
appropriate level needed for
 Denormalizing the data for
performance purposes
 Cleansing of the data to ensure
that integrity is preserved.
Data Sources: Heterogeneous Information Sources

“Heterogeneities are
everywhere” Personal

Scientific Databases Wide
Digital Libraries
♦ Different interfaces
♦ Different data representations
♦ Duplicate and inconsistent information
Goal to Access Information: Unified Access to Data

Integration System

Digital Libraries Scientific Databases Databases

• Collects and combines information

• Provides integrated view, uniform user interface
• Supports sharing
Fundamentals for designing Data
 Data Warehouse and Operational
Environments are Separated
 Data is integrated
 Contains historical data over a
long period of time
 Data is a snapshot data captured
at a given point in time
 Data is subject-oriented
 Mainly read-only with periodic
batch updates
Fundamentals for designing Data
 Data contains several levels of detail
• Current, Old, Lightly Summarized, Highly
 Environment is characterized by Read-
only transactions to very large data
 System that traces data sources,
transformations, and storage
 Metadata is a critical component
• Source, transformation, integration,
storage, relationships, history, etc
 Contains a chargeback mechanism for
resource usage that enforces optimal
use of data by end users
Data Warehouse – Design Considerations
 Data Warehouses
•have to evolve in tune with the changing
business needs
•all the details that are required will never be
known though the main content may be
•designed in such a way that they change
constantly according to business environment
Data Warehouses will have to be
necessarily designed with a certain
amount of flexibility of the solution.
Data Warehouse Workflow:
Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and
Operational Data Store
 Logical Data Mart and @ctive

All involve some form of extraction, transformation and loading (ETL)

Generic two-level architecture

T company-

Periodic extraction  data is not completely current in warehouse

Data marts:
Independent Data Mart Mini-warehouses, limited in scope


Separate ETL for each Data access complexity

independent data mart due to multiple data marts
ODS provides option for
Dependent data mart with operational data store obtaining current data

Simpler data access
Single ETL for Dependent data marts
enterprise data warehouse loaded from EDW
ODS and data warehouse
Logical data mart and @ctive data warehouse are one and the same

Data marts are NOT separate databases,
Near real-time ETL for but logical views of the data warehouse
@active Data Warehouse  Easier to create new data marts
Data Modeling
 Multidimensional Data Schema
• Decision Support Data tends to be
• Nonnormalized
• Duplicated
• Preaggregated
 Based on it, commonly used schemas
• Star Schema (Most common)
• Special Design technique for
multidimensional data representations
• Optimize data query operations instead of
data update operations
• Snowflake Schema
• Normalized form of star schema
Schema Components
 Facts
• Numeric measurements (values) that
represent a specific business aspect or
• Stored in a fact table at the center of the
star scheme
• Contains facts that are linked through
their dimensions
• Can be computed or derived at run time
• Updated periodically with data from
operational databases
 Dimensions
• Qualifying characteristics that provide
additional perspectives to a given fact
Schema Components
 Attributes
• Dimension Tables contain Attributes
• Attributes are used to search, filter, or classify
• Dimensions provide descriptive characteristics
about the facts through their attributed
• Must define common business attributes that will
be used to narrow a search, group information, or
describe dimensions. (ex.: Time / Location /
• No mathematical limit to the number of dimensions
(3-D makes it easy to model)
 Attribute Hierarchies
• Provides a Top-Down data organization
• Aggregation
• Drill-down / Roll-Up data analysis
• Attributes from different dimensions can be
Star Schema


Fact Table
Star Schema Representation
 Fact and Dimensions are represented
by physical tables in the data
warehouse database
 Fact tables are related to each
dimension table in a Many to One
relationship (Primary/Foreign Key
 Fact Table is related to many
dimension tables
• The primary key of the fact table is a
composite primary key from the dimension
 Each fact table is designed to answer
a specific DSS question
Strengths of the Dimensional Model
 the dimensional model is a predictable,
standard framework.
 withstands unexpected changes in user
 extensible to accommodate unexpected new
data elements and new design decisions
 no query tool or reporting tool needs to be
reprogrammed to accommodate the change
 there is a body of standard approaches for
handling common modeling situations in the
business enterprise
 availability of a huge body of administrative
utilities and software processes that
ETL Process:
Data Reconciliation
 Typical operational data is:
• Transient – not historical
• Not normalized (perhaps due to
denormalization for performance)
• Restricted in scope – not comprehensive
• Sometimes poor quality – inconsistencies and
 After ETL, data should be:
• Detailed – not summarized yet
• Historical – periodic
• Normalized – 3rd normal form or higher
• Comprehensive – enterprise-wide perspective
• Quality controlled – accurate with full
The ETL Process
 Capture
 Scrub or data cleansing
 Transform
 Load and Index
 Data flow

ETL = Extract, transform, and load

Steps in data reconciliation

Capture = extract…obtaining a snapshot

of a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract =

snapshot of the source data at capturing changes that have
a point in time occurred since the last static
Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern

recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time

erroneous dates, incorrect field usage, stamping, conversion, key
mismatched addresses, missing data, generation, merging, error
duplicate data, inconsistencies detection/logging, locating missing
Steps in data reconciliation (continued)

Transform = convert data from format

of operational system to format of data

Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Steps in data reconciliation (continued)

Load/Index= place transformed data

into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in

target data at periodic intervals source data are written to data
Da ta f lo ws
 Inflow- The processes associated with the
extraction, cleansing, and loading of the data
from the source systems into the data
 upflow- The process associated with adding value
to the data in the warehouse through
summarizing, packaging , packaging, and
distribution of the data
 downflow- The processes associated with
archiving and backing-up of data in the
 outflow- The process associated with making the
data availabe to the end-users
 Meta-flow- The processes associated with the
management of the meta-data
Reporting, query,application
development, and EIS (executive
information system) tools

Warehouse Manager
data source1
Meta-data High
summarized data
Inflow Outflow
Load summarized
Manager data Query Manager OLAP (online
Upflow analytical processing)
Operational tools
data source n Detailed data DBMS

data store (ods)
Warehouse Manager


Archive/backup Data mining tools

End-user access tools
Information flows of a data warehouse
Probl em s

 High demand for resources

 Increased end-user demands
 High maintenance
 Extracting, cleansing and loading data could
be time consuming.
 Data warehousing increases project scope.
 Problems with compatibility with systems
already in place e.g. transaction processing
 Providing training to end-users, who end up
not using the data warehouse.
 Security could develop into a serious issue,
especially if the data warehouse is web