Sie sind auf Seite 1von 42

Data Warehouse Fundamentals and ETL

Development
By:
Tushar Kant
Gupta
OVERVIEW ON DATAWAREHOUSE
Why Data Warehousing?
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
Why Data Warehousing?
Intelligence Service
 Helping management in
decisions making process
• Used to manage and control business
• Used by managers and end-users to
understand the business and make
judgments
 Identifying critical factors in
achieving the organizational
goals or mission
 Knows the tactical and strategic
aspects of the business and the
Benefi ts of D at a Wareh ousi ng
Intelligence services helps organization to
 Understand business trends and make better
forecasting decisions
 Bring better products to market in a more
timely manner
 Analyze daily sales information and make
quick decisions that can significantly affect
your company's performance
 Data warehousing can be a key differentiator
in many different industries. At present,
some of the most popular Data warehouse
applications include:
 sales and marketing analysis across all
industries
 inventory turn and product tracking in
manufacturing
 category management, vendor analysis, and
marketing program effectiveness analysis in
retail
Impac ts of Data Ware ho usi ng
 Potential high returns on investment

 Competitive advantage

 Increased productivity of corporate


decision-makers
What is Data Warehouse
 A data warehouse is
• relational/multidimensional database,
designed for query and analysis rather
than transaction processing
• contains historical data that is derived
from transaction data
• Separates analysis workload from
transaction workload and enables a
business to consolidate data from
several sources.
Tra nsa ctio n Sy ste m vs. Da ta
Wa re ho use
Operational Decision support

Data Content Current values Archival, summarized, calculated data

Data Organization Application by application Subject areas across enterprise

Nature of Data Dynamic Static until refreshed

Data Structure & Complex; suitable for Simple; suitable for business analysis
Format operational computation
Access Probability High Moderate to low

Data Update Updated on a field-by-field basis Accessed and manipulated; no direct


update
Usage Highly structured repetitive Highly unstructured analytical
processing processing
Response Time Sub-second to 2-3 seconds Seconds to minutes
Conclusion
 Described as the "single point of
truth", the "corporate memory",
the sole historical register of
virtually all transactions that occur
in the life of an organization
 Extraction of Informational data
from Operational or transactional
data to do intelligence
OBJECTIVE
Objective
Extraction of Knowledge from Operational Data
Tr ans for ming oper at ional d at a
int o i nf orm at ional da ta
 The creation of new fields that
are derived from existing
operational data
 Summarizing data to the most
appropriate level needed for
analysis
 Denormalizing the data for
performance purposes
 Cleansing of the data to ensure
that integrity is preserved.
DATA WAREHOUSE DESINGINING
Data Sources: Heterogeneous Information Sources

“Heterogeneities are
everywhere” Personal
Databases

World
Scientific Databases Wide
Web
Digital Libraries
♦ Different interfaces
♦ Different data representations
♦ Duplicate and inconsistent information
Goal to Access Information: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

• Collects and combines information


• Provides integrated view, uniform user interface
• Supports sharing
Fundamentals for designing Data
Warehouse
 Data Warehouse and Operational
Environments are Separated
 Data is integrated
 Contains historical data over a
long period of time
 Data is a snapshot data captured
at a given point in time
 Data is subject-oriented
 Mainly read-only with periodic
batch updates
Fundamentals for designing Data
Warehouse
 Data contains several levels of detail
• Current, Old, Lightly Summarized, Highly
Summarized
 Environment is characterized by Read-
only transactions to very large data
sets
 System that traces data sources,
transformations, and storage
 Metadata is a critical component
• Source, transformation, integration,
storage, relationships, history, etc
 Contains a chargeback mechanism for
resource usage that enforces optimal
use of data by end users
Data Warehouse – Design Considerations
 Data Warehouses
•can NEVER be STATIC
•have to evolve in tune with the changing
business needs
•all the details that are required will never be
known though the main content may be
known
•designed in such a way that they change
constantly according to business environment
Data Warehouses will have to be
necessarily designed with a certain
amount of flexibility of the solution.
Data Warehouse Workflow:
Data Warehouse Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and
Operational Data Store
 Logical Data Mart and @ctive
Warehouse

All involve some form of extraction, transformation and loading (ETL)


ETL
Generic two-level architecture

One,
T company-
wide
warehouse
E

Periodic extraction  data is not completely current in warehouse


Data marts:
Independent Data Mart Mini-warehouses, limited in scope

T
E

Separate ETL for each Data access complexity


independent data mart due to multiple data marts
ODS provides option for
Dependent data mart with operational data store obtaining current data

T
Simpler data access
E
Single ETL for Dependent data marts
enterprise data warehouse loaded from EDW
(EDW)
ODS and data warehouse
Logical data mart and @ctive data warehouse are one and the same

T
E
Data marts are NOT separate databases,
Near real-time ETL for but logical views of the data warehouse
@active Data Warehouse  Easier to create new data marts
DATA MODELING:
Data Modeling
 Multidimensional Data Schema
Support
• Decision Support Data tends to be
• Nonnormalized
• Duplicated
• Preaggregated
 Based on it, commonly used schemas
are
• Star Schema (Most common)
• Special Design technique for
multidimensional data representations
• Optimize data query operations instead of
data update operations
• Snowflake Schema
• Normalized form of star schema
Schema Components
 Facts
• Numeric measurements (values) that
represent a specific business aspect or
activity
• Stored in a fact table at the center of the
star scheme
• Contains facts that are linked through
their dimensions
• Can be computed or derived at run time
• Updated periodically with data from
operational databases
 Dimensions
• Qualifying characteristics that provide
additional perspectives to a given fact
Schema Components
 Attributes
• Dimension Tables contain Attributes
• Attributes are used to search, filter, or classify
facts
• Dimensions provide descriptive characteristics
about the facts through their attributed
• Must define common business attributes that will
be used to narrow a search, group information, or
describe dimensions. (ex.: Time / Location /
Product)
• No mathematical limit to the number of dimensions
(3-D makes it easy to model)
 Attribute Hierarchies
• Provides a Top-Down data organization
• Aggregation
• Drill-down / Roll-Up data analysis
• Attributes from different dimensions can be
Star Schema

Dimension
Tables

Fact Table
Star Schema Representation
 Fact and Dimensions are represented
by physical tables in the data
warehouse database
 Fact tables are related to each
dimension table in a Many to One
relationship (Primary/Foreign Key
Relationships)
 Fact Table is related to many
dimension tables
• The primary key of the fact table is a
composite primary key from the dimension
tables
 Each fact table is designed to answer
a specific DSS question
Strengths of the Dimensional Model
 the dimensional model is a predictable,
standard framework.
 withstands unexpected changes in user
behavior
 extensible to accommodate unexpected new
data elements and new design decisions
 no query tool or reporting tool needs to be
reprogrammed to accommodate the change
 there is a body of standard approaches for
handling common modeling situations in the
business enterprise
 availability of a huge body of administrative
utilities and software processes that
ETL Process:
Data Reconciliation
 Typical operational data is:
• Transient – not historical
• Not normalized (perhaps due to
denormalization for performance)
• Restricted in scope – not comprehensive
• Sometimes poor quality – inconsistencies and
errors
 After ETL, data should be:
• Detailed – not summarized yet
• Historical – periodic
• Normalized – 3rd normal form or higher
• Comprehensive – enterprise-wide perspective
• Quality controlled – accurate with full
integrity
The ETL Process
 Capture
 Scrub or data cleansing
 Transform
 Load and Index
 Data flow

ETL = Extract, transform, and load


Steps in data reconciliation

Capture = extract…obtaining a snapshot


of a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract =


snapshot of the source data at capturing changes that have
a point in time occurred since the last static
extract
Steps in data reconciliation (continued)

Scrub = cleanse…uses pattern


recognition and AI techniques to
upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field usage, stamping, conversion, key
mismatched addresses, missing data, generation, merging, error
duplicate data, inconsistencies detection/logging, locating missing
data
Steps in data reconciliation (continued)

Transform = convert data from format


of operational system to format of data
warehouse

Record-level: Field-level:
Selection – data partitioning single-field – from one field to one field
Joining – data combining multi-field – from many fields to one, or
Aggregation – data summarization one field to many
Steps in data reconciliation (continued)

Load/Index= place transformed data


into the warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in


target data at periodic intervals source data are written to data
warehouse
Da ta f lo ws
 Inflow- The processes associated with the
extraction, cleansing, and loading of the data
from the source systems into the data
warehouse.
 upflow- The process associated with adding value
to the data in the warehouse through
summarizing, packaging , packaging, and
distribution of the data
 downflow- The processes associated with
archiving and backing-up of data in the
warehouse
 outflow- The process associated with making the
data availabe to the end-users
 Meta-flow- The processes associated with the
management of the meta-data
Reporting, query,application
development, and EIS (executive
information system) tools

Operational
Warehouse Manager
data source1
Meta-flow
Meta-data High
summarized data
Inflow Outflow
Lightly
Load summarized
Manager data Query Manager OLAP (online
Upflow analytical processing)
Operational tools
data source n Detailed data DBMS

Operational
data store (ods)
Warehouse Manager

Downflow

Archive/backup Data mining tools


data
End-user access tools
Information flows of a data warehouse
Probl em s

 High demand for resources


 Increased end-user demands
 High maintenance
 Extracting, cleansing and loading data could
be time consuming.
 Data warehousing increases project scope.
 Problems with compatibility with systems
already in place e.g. transaction processing
system.
 Providing training to end-users, who end up
not using the data warehouse.
 Security could develop into a serious issue,
especially if the data warehouse is web
accessible.