Sie sind auf Seite 1von 71

FOR STUDENT REFERENCE ONLY

TRAINER:- CHRISTOPHER RICHARD


1
Data Warehousing
For the Participants of IBM Bangalore
Prepared By
Christopher Richard
DataWarehousing System Architect
[Microsoft Certified Trainer]
2
OBJ ECTIVES
This Training is for you, the Designers, managers, and
owners of the data warehouse.
This Training is a field guide, a set of tools, for designing,
developing, and deploying data warehouses.
Concrete and actionable
The training describes a coherent framework that goes all
the way from the original scoping of an overall data
warehouse, through all the detailed steps of developing
and deploying the data warehouse.
Along the way, I hope to give you the perspective and
judgment I have accumulated in doing several data
warehouse installations and consultation assignments
since 1996 FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
3
OBJ ECTIVES
Achieve your goals of building a data warehouse
more quickly
Build effective data warehouses that match well
against the goals.
And Make fewer mistakes along the way
You will not reinvent the wheel and discover
previously owned truths.
Structure and discipline to help in building a large
and complex data warehouse.
4
Evolution of Data Warehousing
How Did We Get Here? FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
5
The progression
1st data warehouse in 1905 by Dupont Corp
1st data cube by sales, branch and date
1970s - Management Decision Systems developed
product called Express (Oracle)
1983 Metaphor - founded by Ralph Kimball and 2
partners as standalone DSS
Lessons learned - manage information as corporate resource
1980 - E.F.Codd- Promise of relational databases (data
every which way)
Inmon1993 - Popularisationof the term
6
Evolution through 90s
Reporting
Summarization
EIS applications
OLAP
Data Mining
Intelligent Agents
Active Warehouses FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
7
Data Warehousing Industry
8
Data Warehousing Industry FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
9
Introduction
The data warehouse marketplace has moved
beyond its infancy
A data warehouse is continuously evolving and
dynamic.
A data warehouse cannot be static.
Complete Lifecycle perspective.
At the very least, a data warehouse needs to evolve
as fast as the surrounding organization evolves.
Adjust our expectations and our techniques from
the original idealistic, static view
10
Introduction
We need design techniques that are flexible and
adaptable.
We need to be half DBA and half MBA.
We need our changes to the data warehouse to
always be graceful.
There is a number of security topics you simply
have to understand if you are going to perform
your job responsibly.
Welcome to Data Warehousing!!!! FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
11
MESSAGE:
Information Requirements are Increasing -
Geometrically
A Goodly Chunk of Them will have to be Met,
so Build a Data Warehouse
BUT, BEFORE YOU BUILD A DATA WAREHOUSE
!The DWConsultants will Steal You Blind
INFORM YOURSELF - If You Dont
12
TO INFORM YOURSELF:
!READ: The Data Warehouse Toolkit
!READ: The Data Warehouse Lifecycle Toolkit
!JOIN: This Data Warehouse Training Program
!ATTEND One Implementation Conference
!WATCH Every Presentation on Data Warehousing you can
!SUSCRIBE to these Listservs
!DW-List: http://www.datawarehousing.com/list.asp
!EduCause: http://www.educause.edu/memdir/cg/cg.html FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
13
The Goals of a Data Warehouse
The Most important assets of an organization is almost always
kept in two forms
The Operational systems of record
The Data Warehouse
Ultimately, we need to put aside the details of implementation
and modeling, and remember what the fundamental goals of
the data warehouse are.
Makes an organizations information accessible
Makes the organizations information consistent
Is an adaptive and resilient source of information
Is a secure bastion that protects our information asset
Is the foundation for decision making
Is accepted and used by the end user
14
The Chess Pieces
Source System-
An operational system of record whose function it is to
capture the transactions of the business
Main Properties of a source system are uptime and and
availability.
Data Staging Area-
A Storage area and set of processes that clean,
transform, combine, de-duplicate, household, archive
and prepare source data for use in the data warehouse.
No User Query services FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
15
The Chess Pieces
Presentation Server -
The target physical machine on which the data
warehouse data is organized and stored for direct
querying by end users, report writers, and other
applications.
Dimensional Model
A specific discipline for modeling data that is an
alternative to entity relationship (E/ R) modeling.
Business Process
A coherent set of business activities that make sense to
the business users of our data warehouses
16
The Chess Pieces
ROLAP ( Relational OLAP )
A storage option or set of user interfaces and
applications that give a relational database a
dimensional flavor.
MOLAP ( Multidimensional OLAP)
A storage option or set of user interfaces and
applications and proprietary database technology that
have a strongly dimensional flavor.
HOLAP ( Hybrid OLAP)
A storage option of both relational and proprietary
structure. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
17
The Chess Pieces
Data Mart
A logical subset of the complete data warehouse.
Data Warehouse -
The queryable source of data in the enterprise.
OLAP (On-line Analytic Processing)
The general activity of querying and presenting text
and number data from data warehouses, as well as a
specifically dimensional style of querying and
presenting that is exemplified by a number of OLAP
vendors
18
The Chess Pieces
End User Application
A collection of tools that query, analyze, and present
information targeted to support a business need.
End User Data Access Tool -
A client of the data warehouse.
Ad Hoc Query Tool
A specific kind of end user data access tool that invites
the user to form their own queries by directly
manipulating relational tables and their joins. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
19
View the data
Create reports
Ad-hoc
Fine Tuning
All DoneNOT!!
20
The Chess Pieces
Modeling Applications
A sophisticated kind of data warehouse client with analytic
capabilities that transform or digest the out put from the data
warehouse.
Modeling applications include :
Forecasting models
Behavior scoring models
Allocation models
Data mining tools
Metadata
All the information in the data warehouse environment that is
not the actual data itself. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
21
DWH Architecture
supports
External
Sources
Data Warehouse
OLAP
Servers
Tools for
extraction,
cleaning,
loading,
integration, etc.
Data Marts
Operational DBs
Client Tools
Information Sources
Data
Mining
OLAP tools for
Queries/ Reports
Analysis
22
Two Different Worlds
OLTP is profoundly different from dimensional data
warehousing.
Design techniques and design instincts appropriate for
transaction processing are inappropriate and even
destructive for data warehousing.
Consistency
OLTP consistency is microscopic
All we care about is that all transactions presented to the system
have been accounted
Data warehouse has a quality assurance perspective.
We care enormously that the current load of data is a full and
consistent set of data FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
23
Two Different Worlds
Transaction
OLTP system processes thousands or even millions of
transitions
DW will process only one transaction per day.
We call it a Production Data Load
Users and Managers
OLTP system users turn the wheels of an organization
OLTP system users almost always deal with one account at a
time.
They perform the same task many, many times.
Performance is the absolute king of the OLTP system
Reporting is the primary activity of the Data warehouse.
24
Two Different Worlds
One Machine or Two
The resource argument is usually sufficient reason to require a
second machine
The data warehouse is often a centralized resource where data is
integrated from multiple remote OLTP systems.
Data must be copied and restructured from the DW.
The Time Dimension
OLTP database is a twinkling database
This is the first temporal inconsistency that we avoid in a data
warehouse.
It is a major burden on the OLTP system to correctly depict old
history. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
25
Two Different Worlds
The Entity Relational Data Model
E/ R model the Miracle
Drives out redundancy
The closest analogy is to the map of Los Angles.
The E/ R model is very symmetric.
Huge number of connection paths between tables.
The value of the E/ R model is to use the tables individually and
in pairs
E/ R models are a disaster for querying coz they cannot be
understood by users.
And cannot be navigated usefully by DBMS software.
E/ R model cannot be used as the basis for an enterprise DW.
26
A small subset of tables
of an existing system
Typical ERDs FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
27
NorthwindDatabase Model Relational Format
Categories
PK CategoryID
I1 CategoryName
Description
Picture
Territories
PK TerritoryID
TerritoryDescription
FK1 RegionID
Products
PK ProductID
I3 ProductName
FK2,I4,I5 SupplierID
FK1,I2,I1 CategoryID
QuantityPerUnit
UnitPrice
UnitsInStock
UnitsOnOrder
ReorderLevel
Discontinued
CustomerCustomerDemo
PK,FK2 CustomerID
PK,FK1 CustomerTypeID
CustomerDemographics
PK CustomerTypeID
CustomerDesc
EmployeeTerritories
FK2 TerritoryID
FK1 EmployeeID
Customers
PK CustomerID
I2 CompanyName
ContactName
ContactTitle
Address
I1 City
I4 Region
I3 PostalCode
Country
Phone
Fax
Region
PK RegionID
RegionDescription
Order Details
PK,FK1,I2,I1 OrderID
PK,FK2,I4,I3 ProductID
UnitPrice
Quantity
Discount
Shippers
PK ShipperID
CompanyName
Phone
Orders
PK OrderID
FK1,I1,I2 CustomerID
FK2,I4,I3 EmployeeID
I5 OrderDate
RequiredDate
I6 ShippedDate
FK3,I7 ShipVia
Freight
ShipName
ShipAddress
ShipCity
ShipRegion
I8 ShipPostalCode
ShipCountry
Suppliers
PK SupplierID
I1 CompanyName
ContactName
ContactTitle
Address
City
Region
I2 PostalCode
Country
Phone
Fax
HomePage
Employees
PK EmployeeID
I1 LastName
FirstName
Title
TitleOfCourtesy
BirthDate
HireDate
Address
City
Region
I2 PostalCode
Country
HomePhone
Extension
Photo
Notes
FK1 ReportsTo
PhotoPath
28
The Dimensional Model
A Simple data cube structure that matches end
users needs for simplicity
The dimensional model is very asymmetric.
One large dominant table in the center of the
schema.
It is the only table in the schema with multiple
joins.
The center table is called the Fact Table.
The other tables are called the Dimension
Tables. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
29
Components of a Star Schema
Components of a Star Schema
30
Star Schema Example
Star Schema Example FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
31
NorthwindDatabase Star Schema Orders
d i m C u s t o m e r s
P K C u s t o m e r K e y
C u s t o m e r I D
C o m p a n y N a m e
C o n t a c t N a m e
C o n t a c t T i t l e
A d d r e s s
C i t y
R e g i o n
P o s t a l C o d e
C o u n t r y
P h o n e
F a x
C u s t o m e r T y p e I D
C u s t o m e r D e s c
d im S h i p p e r s
P K S h ip p e r K e y
S h i p p e r I D
C o m p a n y N a m e
P h o n e
f c t O r d e r s
P K O r d e r K e y
F K 3 P r o d u c tK e y
F K 2 E m p l o y e e K e y
F K 1 C u s to m e r K e y
F K 4 S h i p p e r K e y
F K 6 O r d e r D a t e K e y
F K 5 R e q u i r e d D a t e K e y
F K 7 S h i p p e d D a t e K e y
O r d e r I D
S h i p V i a
F r e ig h t
S h i p N a m e
S h i p A d d r e s s
S h i p C i t y
S h i p R e g i o n
S h i p P o s t a l C o d e
S h i p C o u n t r y
d i m E m p lo y e e s
P K E m p lo y e e K e y
E m p l o y e e I D
L a s t N a m e
F ir s t N a m e
T it l e
T it l e O f C o u r t e s y
B i r t h D a t e
H ir e D a t e
A d d r e s s
C it y
R e g io n
P o s t a lC o d e
C o u n t r y
H o m e P h o n e
E x t e n s i o n
P h o t o
N o t e s
R e p o r t s T o
P h o t o P a t h
T e r r it o r y I D
T e r r it o r y D e s c r i p t i o n
R e g io n I D
R e g io n D e s c r i p t i o n
d i m O r d e r D e t a i l s
P K P r o d u c t K e y
O r d e r I D
U n it P r i c e
Q u a n t i t y
D i s c o u n t
E x t e n d e d P r ic e
P r o d u c t I D
P r o d u c t N a m e
Q u a n t i t y P e r U n i t
U n it P r i c e
U n it s I n S t o c k
U n it s O n O r d e r
R e o r d e r L e v e l
D i s c o n t i n u e d
C a t e g o r y I D
C a t e g o r y N a m e
D e s c r i p t i o n
S u p p l ie r I D
C o m p a n y N a m e
C o n t a c t N a m e
C o n t a c t T i t l e
A d d r e s s
C i t y
R e g i o n
P o s t a l C o d e
C o u n t r y
P h o n e
F a x
H o m e P a g e
d i m D a t e
P K D a t e K e y
D a y D a t e
D a y D a t e _ Y Y Y Y M M D D
D a y O f W e e k N a m e
D a y O f W e e k N a m e A b b r v
D a y N u m b e r I n W e e k
D a y N u m b e r I n M o n t h
D a y N u m b e r I n Q u a r t e
D a y N u m b e r I n Y e a r
W e e k D a y I n d i c a t o r
W e e k E n d I n d i c a t o r
W e e k _ Y Y Y Y W W
W e e k N u m b e r I n Y e a r
M o n t h _ Y Y Y Y M M
M o n t h N a m e
M o n t h N a m e A b b r v
M o n t h N u m b e r I n Y e a r
Q u a r te r _ Y Y Y Y Q
Q u a r te r N a m e
Q u a r te r N a m e A b r v
Q u a r te r N u m b e r I n Y e a r
Y e a r
32
Dimensions in Data Analysis
In the world of data warehousing, a summarizable numerical
value that you use to monitor your business is called a FACT
When looking for numeric information your first question will
be What Fact U want to see?
You could look at lets say, sales units, sales dollars, defects etc.
Suppose that U ask to see a report of your companys Units
Sold.
Heres what u get:
113 FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
33
Dimensions in Data Analysis
Looking at one value doesnt tell you much. You want to
break it into some thing more informative. For example,
how has your company done over time.
You ask for a monthly report on Units Sold
Heres the new report
January February March April
14 41 33 25
34
Dimensions in Data Analysis
Your Still not satisfied with the monthly report. Your
company sells more than one product how did each of
those products do over time?
You ask for a new report on Units Sold by product and
time
Heres the new report
6 17
Feb Mar Apr J an
Salt Bread
Sweet Bread
Muffins
6
8
16
25
6
21
8 FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
35
Dimensions in Data Analysis
Suppose your company sells in two different states and you wouldlike to know
how each product is doing each month in each state.
You ask for a new report on Units Sold by product by time and state
Heres the new report
3 10
Feb Mar Apr J an
Salt Bread
Sweet Bread
Muffins
3
4
16
16
6
6
3 7 Salt Bread
Sweet Bread
Muffins
3
4 9 15
8
KA
TN
36
Dimensions in Data Analysis
Whichever way you layout your report, it has 3
independent list of labels
The total number of potential values in the report equals
the number of unique items in the first independent list of
labels(2 States) & the number of unique items in the
second independent list of labels(3 products) * the
number of unique items in the third independent list of
labels(4 months)
In place of independent list of labels, data warehouse
designers borrow the term dimension from mathematics. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
37
Dimensions in Data Analysis
Thus our report has 3 dimensions TIME, STATE
and PRODUCTS
The items in a dimension are called members of
that dimension.
38
Hierarchies in Data Analysis
Grouping aggregating is the way that humans deal
with numerous items.
Once your company has sold items for over a year you
would like to look at reports for a year, quarter and
month.
But how do aggregations such as quarters fit into a
dimension.
Generally you think of members in a dimension as
belonging together FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
39
Hierarchies in Data Analysis
Do months and Qtr belong together
Months & Quarters form an hierarchywithin the
Time Dimension, and each degree of
summarization is referred to as a level.
The member at the lowest level of detail are called
leaf members.
There are 3 types of hierarchies that you may
encounter
Balanced Hierarchies
Unbalanced Hierarchies
Ragged Hierarchies
40
Balanced Hierarchies
1998
Qtr1 Qtr2 Qtr3 Qtr4
Jan Feb Mar
Apr May Jun Aug Jul Sep
Oct Nov Dec FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
41
Unbalanced Hierarchies
Sheri
Darren Maya
Rebecca Walter
Brenda Jonathan
42
Ragged Hierarchies
North America
USA Canada Mexico
North West
California Oregon Washington
Brit
Columbia
Dist
Federal
Zacatecas FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
43
Fact Table
A Fact Table is a table in the relational data warehouse
that stores the detailed values for measures, or facts.
Example a fact table that stores Dollars and Units by
state, by product and by Month has five columns.
The first 3 columns are Key columns, the remaining two
are measure values.
State Product Month Units Dollars
44
Fact Table
Each column in the fact table should be either a key or a
measure.
The fact table must contain a column for each measure.
The fact table must contain rows at the lowest level of detail you
might want to retrieve for a measure.
A fact table almost always uses an integer key for each member
rather than a descriptive name.
The key column for a date dimension might be either an integer
key or a date. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
45
Dimension Tables
A dimension table contains one row for each leaf level
member of the dimension.
Ex. A product dimension table with 3 products will have
3 rows.
In most cases a dimension table also contains one column
containing a numeric key columns that uniquely identifies
each member.
This column that contains the unique value is the primary
key and references the foreign key in the fact table.
46
Dimension Tables
If the dimension is involved in a balanced hierarchy it will
have an additional column that gives the parent for each
member.
Ex.if you have 3 products in a dimension table that belong to
a particular product Subcategory your table will look like this.
PROD_ID Prod_Name SubCategory
589
592
1218
Sweet Muffins
Coconut Muffins
Salt Bread
Muffins
Muffins
Bread FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
47
Star Schema
When each dimension is stored in a single table, the
databases organization is called a star Schema Design.
When a Database Dimensions are stored in a chain of
tables, the databases design is called a Snowflake
Design.
A relational database must perform time consuming joins
each time a report executes, and a star design for a
dimension requires fewer joins than a snowflake design.
48
CUSTOMER
PK CUSTKEY
NAME
STREET
CITY
STATE
ZIP
SHIPMENTS
PK,FK4 PRODKEY
PK INVOICE
FK1,I1 PERKEY
FK2,I3 CUSTKEY
FK3,I4 SHIPKEY
DOLLARS
WEIGHT
PERIOD
PK PERKEY
MONTH
YEAR
QUARTER
TRI
DATE_COL
PRODUCT
PK PRODKEY
PRODUCT
DISTRIBUTOR
BERRY
AROMA
ACID
BODY
ROAST
SHIPDATE
PK PERKEY
MONTH
YEAR
QUARTER
TRI
DATE_COL
Stargood
Snowflake
BAD!!!!
D_PROD
I1 PROD_CODE
PROD_NAME
POSITION
TYPE
VERSION
Star V/s Snowflake Schema FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
49
Star Schema with Sample Data
50
Tip
Some times when we are designing a DW, it is
unclear whether a numeric data field extracted
from a production data source is a fact or an
attribute.
Simply ask yourself the question.
Is the numeric data field a measurement that
varies every time we sample it?
Or whether it is a discretely valued description
of some thing that is more or less constant? FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
51
Data
Connection(s) Layer
ETL
Query Tools
Analysis Tools
Presentation Interface
Quality Assurance procedures
*Politics*
Data Warehouse System
52
Basic Processes - Data Warehouse
Extracting The first step of getting Data into the data
warehouse.
Transformation Once data extracted into the data
staging area, many possible transformation steps,
including Cleaning the data, correcting misspelling,
purging selected fields, Creating Surrogate keys for each
dimension, Building Aggregates etc.
Loading and Indexing Loading in the data
warehouse. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
53
Excel Spreadsheets
Access database
A plethora of other RDBMSs
Most of your work will be in the ETL, data
staging area. This will make or break your
project!
Consolidation of Disparate Data Sources
54
Basic Processes - Data Warehouse
Quality Assurance Checking Quality assurance can
be checked by running a comprehensive exception report
over the entire new set of newly loaded data.
Release/ Publishing - The User community must be
notified that the new data is ready.
Updating Modern data marts may well be updated,
sometimes frequently. Changes in labels, changes in
hierarchies, changes in status, and changes in corporate
ownership. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
55
Basic Processes - Data Warehouse
Querying Querying is abroad term that encompasses
all the activities of requesting data from a data mart.
Data Feedback/ Feeding in Reverse The data can
also flow in the opposite direction uphill from the
traditional flow we have discussed.
Auditing At times it is critically important to know
where the data came from and what were the calculations
performed. For this you can create special audit records.
56
Basic Processes - Data Warehouse
Securing - Every data warehouse has an exquisite
dilemma: Publishing the data as widely to as many users
as possible with the easiest of user interfaces, at the same
time protect the data from misuse and snoopers.
Backing Up and Recovering Since data warehouse
data is a flow of data from the legacy system on through
to the data marts and eventually onto the users desktops,
a real question arises about where to take the necessary
snapshots. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
57
Core Pieces
Select Reporting Tool
Must be simple yet robust for Clients
Performance, server/ client work load
Security, server/ client layers
Select ETL method
Use what you know best
Ease of maintenance
58
Steps in the Design Process
It is good to approach the design for a data warehouse in
a consistent way.
You can archive this by following the four steps in a
particular order
Remember the perspective necessary to actually make
these decisions come from an understanding of the end
user requirements and what is in the legacy data sources
that are available to the data warehouse
Choose a business process to model
Choose the grain of the business process
Choose the dimensions and their attributes
Choose the measured facts FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
59
Database Design Methodology for Data Warehouses
Nine-Step Methodology includes
following steps:
Choosing the process
Choosing the grain
Identifying and conforming the dimensions
Choosing the facts
Storing pre-calculations in the fact table
Rounding out the dimension tables
Choosing the duration of the database
Tracking slowly changing dimensions
Deciding the query priorities and the query modes.
60
Step 1: Choosing The Process
The process (function) refers to the
subject matter of a particular data mart.
First data mart built should be the one
that is most likely to be delivered on time,
within budget, and to answer the most
commercially important business
questions. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
61
ER Model of an Extended Version of DreamHome
62
ER Model of Property Sales Business Process of DreamHome FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
63
Step 2: Choosing The Grain
Decide what a record of the fact table is to
represent.
Identify dimensions of the fact table. The grain
decision for the fact table also determines the grain
of each dimension table.
Alsoinclude time as a core dimension, which is
always present in star schemas.
64
Grain
Level of detail at which measures are recorded
Provide meaning to a number stored in the fact
table
Fact= revenue
Dimension= day, sales person, product
Grain= revenue per day per sales person per
product FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
65
Step 3: Identifying and Conforming the Dimensions
Dimensions set the context for asking
questions about the facts in the fact table.
If any dimension occurs in two data
marts, they must be exactly the same
dimension, or one must be a
mathematical subset of the other.
A dimension used in more than one data
mart is referred to as being conformed.
66
Star Schemas for Property Sales and Property Advertising FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
67
Step 4: Choosing The Facts
The grain of the fact table determines
which facts can be used in the data mart.
Facts should be numeric and additive.
Unusable facts include:
non-numeric facts,
non-additive facts,
fact at different granularity from other facts in table.
68
Property Rentals with aBadly Structured Fact Table FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
69
Property Rentals with Fact Table Corrected
70
Step 5: Storing Pre-Calculations in the Fact Table
Once the facts have been selected each
should be re-examined to determine whether
there are opportunities to use pre-
calculations. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
71
Step 6: Rounding Out The Dimension Tables
Text descriptions are added to the
dimension tables.
Text descriptions should be as intuitive and
understandable to the users as possible.
Usefulness of a data mart is determined by
the scope and nature of the attributes of the
dimension tables.
72
Step 7: Choosing The Duration Of The Database
Duration measures how far back in time the
fact table goes.
Very large fact tables raise at least two very
significant data warehouse design issues.
Often difficult to source increasing old data.
It is mandatory that the old versions of the important
dimensions be used, not the most current versions.
Known as the Slowly Changing Dimension problem. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
73
Step 8: Tracking Slowly Changing Dimensions
Slowly changing dimension problem
means that the proper description of the
old dimension data must be used with old
fact data.
Often, a generalized key must be assigned
to important dimensions in order to
distinguish multiple snapshots of
dimensions over a period of time.
74
Step 8: Tracking Slowly Changing Dimensions
Three basic types of slowly changing
dimensions:
Type 1, where a changed dimension attribute is
overwritten.
Type 2, where a changed dimension attribute causes a
new dimension record to be created.
Type 3, where a changed dimension attribute causes
an alternate attribute to be created so that both the old
and new values of the attribute are simultaneously
accessible in the same dimension record. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
75
Step 9: Deciding The Query Priorities And The Query Modes
Most critical physical design issues
affecting the end-users perception
includes:
physical sort order of the fact table on disk;
presence of pre-stored summaries or aggregations.
Additional physical design issues include
administration, backup, indexing
performance, and security.
76
Database Design Methodology for Data Warehouses
Methodology designs a data mart that
supports requirements of particular
business process and allows the easy
integration with other related data marts to
form the enterprise-wide data warehouse.
A dimensional model, which contains more
than one fact table sharing one or more
conformed dimension tables, is referred to
as a fact constellation. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
77
Fact and Dimension Tables for each Business Process of
DreamHome
78
Dimensional Model (Fact Constellation) for the DreamHomeData
Warehouse FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
79
When I wish upon a Star
80
Are You Familiar
The Goals of a Data Warehouse
The Chess Pieces
Different worlds OLTP/ Data warehouse
Dimensional Model Basic
Hierarchies in Dimensions
The Fact Table
The Star Schema
The Snowflake Schema
Basic Processes of a Data warehouse FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
81
What Is ETL?
Extract Extract-- the process of reading data from a outer database.
Transform Transform-- the process of converting extracted data to a
form useable by the target database.
Occurs by using rules or lookup tables or by combining the data with other
data.
Load Load-- the process of writing the data into the target database.
82
What does ETL do?
Extracts data from multiple data sources
Migrates data from one DBto another
Converts DBfrom one format or type to another.
Transforms the data to make it accessible to business
analysis
Forms data marts and data warehouses
Enables loading of multiple target databases
Performs at least three specific functions
reads data from an input source ;
passes the stream of information through either an ETL engine-
or code-based process to modify, enhance, or eliminate data
elements based on the instructions of the job;
writes the resultant data set back out to a flat file, relational table,
etc. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
83
What can ETL be used?
To acquire a temporarysubset of data (like a
VIEW) for reports or other purposes.
A more permanent data set may be acquired for
other purposes such as: the population of a
data mart or data warehouse
84
ETL SYSTEM
Operational Data
Outer Sources
Different vendor
Different format
ETL Engine
Extract
Transform
Load
Filter
Data Warehouse
Local Data Marts
Local Data Marts
Local Data Marts
Local Data Marts
OLAP End Users
OLAP End Users
OLAP End Users
OLAP End Users
Data extracted from the data warehouse
provide faster processing FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
85
Technical architecture design
Design of the technical environment
to enable the logical design
It is a description of the elements
and services of the BI environment
A map of how the components will
fit together and communicate
Basically.a blue print by which the
team, consultants, and vendors will
build the Business Intelligence
Environment
86
PA/PM
Siemens/SMS
DEC Alpha Unix
Sybase
Critical Paths
Landacorp
Oracle
HP-Unix
Budgets
Custom
Mainframe
SAS
Cost Reports
Contract monitoring
MS - Windows
MS - Excel
Acquisition Services
Data Staging Services
- Extraction
- Transformation
- Load
- Cleansing
Data Staging
Administration
- Job/Process Control
- Job/Process Monitoring
- Metadata exchange
- Data Modeling
Load
Files
Data Staging Area
Data Warehouse
Organization Services
Metadata Services
- Source/Target Models
- Business Definitions
- Audit Statistics
- Performance Statistics
- ETL Statistics
Metadata
Exchange
Metadata
Repository
Consumption Services
Data Mart
OLAP
MDB
Data Mart
RDBMS
Data Access Services
- Report Library Management
- Report Distribution
- Report Scheduling
- OLAP Cube Refreshing
- Query Management
- Aggregation Management
- Security Verification
- Metadata Navigation
Data Services
- Bulk Data Loader
- Aggregation Management
- Index Management
- Audit Statistics
- DBA Administrator
- Security Administration
Program
Evaluation
OLAP MDB
Performance
Based Budgeting
RDBMS
Planned Services
-Web Reporting
- Web OLAP
- Data Mining
Data Warehouse
Administration
- Data Modeling
- Data Access Tool Mgmt.
- Data Base Administrator
- Data Staging Administration
The architecture conceptual model FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
87
PA/PM
Siemens/SMS
DEC Alpha
Unix
Sybase
Data Staging Services
- Extraction
- Transformation
- Load
- Cleansing
Data Staging
Administration
- Job/Process Control
- Job/Process Monitoring
- Metadata exchange
- Data Modeling
Load
Files
Data Staging Area
Metadata
Exchange
COSTS
Eclipsys/TSI
Compaq HPUX
Oracle
BUDGETS
Custom
Mainframe
SAS
PATHWAYS
Landacorp
IBM AIX
Oracle
Organization
Services
Source Systems
Data acquisition services
88
Acquiring the data
PM/PA EMR AP/MM Home Solucient State
MR CDR
Etc.
GL/HR
Internal &
External
Data
Obstacles to
Integration
" Different data models
" Different data definitions
" Different data base systems
" Different computer platforms
" Dirty data
" Number of operational sources
! Hand code extraction, transformation, cleansing, and loading
services using the data manipulation language of choice (e.g.,
SAS, COBOL, MS DTS, Perl), most common approach
especially for proprietary DSS data models
2 Buy acquisition services from an ETL software vendor and
customize to your environment.
Approaches
to Acquisition FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
89
ETL attributes = $$$$
Multi threaded engines (e.g., Informatica, Cognos) or
Code generation (e.g., ETI, SAS, DataStage)
Number of Source/ Target DBMS supported
Number of computing platforms supported
(1-tier, 2-tier, N-Tier)
Change data capture
Breadth of transformation techniques
Metadata driven
What metadata standard?
Multiple data loading options (incremental, bulk,
table management, partitioning)
90
CARLETON
INFORMATICA INFORMATICA
ETL technology - horizontal marketplace FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
91
" The large HIS vendors will adopt generic ETL technology and customize the
functionality to their application portfolio and data bases.
" Horizontal ETL vendors MAY develop health care vendor portfoliossuch as they
do for ERP vendorsbut that will depend on demandand if they survive.
" DBMS providers will increasingly provide powerful ETL solutions making any
third-party tool obsolete, assuming you have a homogenous DBMS
implementation.
" Addressing data quality will be the hardest process and tool set to sell to
healthcare organizations.
" Transitioning from hard-coded interfaces to a metadata driven data acquisition
environment will follow the typical healthcare technology adoption cycle, that is, a
long time.
ETL technology predictions
92
Data
Warehouse
Metadata
Services
- Source/Target Models
- Business Definitions
- Audit Statistics
- Performance Statistics
- ETL Statistics
Metadata
Repository
Data Services
- Bulk Data Loader
- Aggregation Mgmt
- Index Management
- Audit Statistics
- DBA Administrator
- Security Admin
Acquisition
Services
Load
Files
Organization services FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
93
ERWin
Embarcadero
Or DSS proprietary data
models
Data Staging Services
- Extraction
- Transformation
- Load
- Cleansing
Load
Files
Metadata
Exchange
Source and Target data models are the center
of a metadata driven environment.
Data modeling tools
94
Issues that are key to an effective ETL tool
" Scheduling and job dependencies: particularlyrelies
on graphical environment.
" Session nesting:When developing an ETL session for
a particular part of the system, nesting eliminates
duplicate development.
" Robust SQL support:Increases speed over using code
to read and write to a database.
" Version management:enables quick roll back rather
than manually making code changes. In many cases,
the DBs version control may not work on the ETL. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
95
Key Issues (Contd)
Debugging functionality:very useful for
developer support.
ETL should rely on underlying database
security.
Transformation capabilitiesvs. cleansing
capabilities:seldom very strong in both.
Metadata support:must work with the overall
metadata strategy.
96
Current ETL Market Share
Total Market Share: $667 Million FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
97
ETL Evaluation
Throughout the following sections, each of the vendors and their ETL
products are evaluated, focusing on primary differences between such
products.
Ascential Software
Formed in July 2001
Focuses on improving, developing, and perfecting their ETL and
back-end tools
Do not have current plans of entering the BI tool market.
The Ascential DataStageproduct family
highly scalable ETL solution
uses end-to-end metadata management and data quality assurance
functions.
can create and manage scalable, complex data integration for enterprise
applications such as CRM, ERP, SCM, BI/ analytics, E-business and data
warehouses.
98 FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
99
Cognos Corporation
Founded in 1969
Prefers that all components of the enterprise data
warehouse are CognosProducts
DecisionStreameasily integrates with CognosBI tools, etc.
has difficulty integrating with other vendor Products.
DecisionStreamis powerful ETL software
Allows users to extract and unite data from disparate sources
and deliver coordinated Business Intelligence across your
organization.
includes advanced data merging, aggregation and
transformation capabilities: let users unite data from different
sources, and transform it into information using best-practices
dimensional design.
100 FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
101
Informatica PowerConnect
An extension to Informatica PowerCenter, and PowerCenterRT data integration
software.
Eliminates the need for customers to manually code data extraction programs for
their enterprise applications.
Ensures that mission-critical operational data can be effectively used to inform key
business decisions across the enterprise.
Allows companies to directly source and integrate:
ERP
CRM
Real-time message queue
Mainframe
AS/ 400
Remote data
Metadata
with other enterprise data and deliver it to:
Data warehouses
Operational data stores
Business intelligence tools
Packaged analytic applications.
102 FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
103
Conclusion
Issues analyzed:
development environments
version control
Securities
metadata exchanges standards
Cost
The ETL tools presented by Ascential and Informatica are
comparable in numerous ways
it would be best to select Informatica as an ETL vendor.
more mature and stable as a company
104
The Staging Area
How to Stock Your Data Warehouse
Pantry
Christopher Richard
[Data Warehousing System Architect] FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
105
All-You-Can-Eat Buffet
Buffet (ODS, DW, DM)
Recipe (Business/ transformation rules)
Kitchen (ETL)
Ingredients from different suppliers (Source
systems)
Pantry (Staging Area)
Our topic is the pantry the Staging Area,
because it is the foundation & stepchild of Data
Warehousing
106
Why have a pantry?
Minimizing processing on source systems
Extract only once
Data integrity
Source data within own control
Incrementals
Freedom of storage format and abstraction
Audit trail
Persistence of data
Timing flexibility
Processing power
Consistent interface for downstream processes FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
107
Minimizing processing on source systems
Extract only once
Staging Area serves downstream systems, thus limiting
impact to the source system
Consistent extract methodology
Central knowledge base of source system extraction
expertise
Data Integrity
Proper timing of different extracts within source system
schedules
Both table-centric and document-centric extraction can be
applied as necessary
108
Table-centric Vs Document-centric Extraction
2/ 1/ 2001
Order Date Order
Amount
Order
Number
!00.00 1000
2
1
Line
Number
Qty Product Order
Number
20 B 1000
10 A 1000
1
Restart
ID
2/ 1/ 2001
Order Date Order
Amount
Order
Number
100.00 1000
3
2
Restart
ID
2
1
Line
Number
Qty Product Order
Number
20 B 1000
10 A 1000
2/ 1/ 2001
Order Date Order
Amount
Order
Number
100.00 1000
2
1
Line
Number
Qty Product Order
Number
20 B 1000
10 A 1000
2
1
Restart
ID
2/ 1/ 2001
2/ 1/ 2001
Order
Date
100.00
100.00
Order
Amount
2
1
Line
Number
Qty Product Order
Number
20 B 1000
10 A 1000
Source
Staging Area
Table-centric
Document-centric FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
109
Incremental Source Extraction
Reliable Change Identifier
Ever increasing number
Timestamp
Correlated Change Identifier
Change Log
Dont Forget about deletes
Hard deletes
Soft deletes
110
Incrementals Implementation
Cyclic Redundancy Checksum
Calculate for extracted increment
True delta identification, should precede all other items
Data Manipulation Language Code [Insert, Update, Delete]
Propagatable after reassessment
Column Change Bitmap
Easy identification for downstream systems (Type 2 SCD)
Restart Identifier [Bookmark]
An ever-increasing number unique in the whole Staging Area
Used to quickly identify the records not yet processed by downstream systems
Source Key Identifier [1:1 with source key]
An ever-increasing number unique for a particular source key, in the whole Staging
Area
Multiple per source key allowed to support source key re-use FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
111
Column Change Bitmap Example
Shoe
Product Type Color Product
Blue A
001
Change
Bitmap
24
Restart
ID
Shoe
Product
Type
Color Product
Red A
Shoe
Product Type Color Product
Red A
2/ 1/ 2001
EffectiveDate Price Product
50.00 A
5/ 1/ 2001
EffectiveDate Price Product
55.00 A
011
Change
Bitmap
49
Restart
ID
5/ 1/ 2001
Effective
Date
Price Product
55.00 A
Shoe
Product
Type
0011
Change
Bitmap
24
Restart
ID
5/ 1/ 2001
Effective
Date
Price Product
55.00 A
Source Tables
Staging Area Tables
Data Mart
Table
112
Audit Trail
Track data lineage
Track data movement across tables and systems
Try to tag the data as soon as it enters the stream
Track data changes
Track data changes within a table
Automate data change tracking outside of coding discipline wherever
possible FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
113
Audit Trail - Implementation
Propagation of the identifiers to downstream processes
Restart Identifier
Source Key Identifier
Source System Identifier
Table specific audit data
Job Run Identifier
Source extract date & time
Create and change date & time and user
Column Change Bitmap
114
Key learnings from doing
True delta determination is essential for large data
volumes and Type II/ III Slowly Changing Dimensions
You will have to compromise functionality for
performance
You will have to compromise data completeness for
performance
Allow staging tables to differ in design from the source
tables
Cookie cutters dowork FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
115
Key learnings from doing
Use one sequencer for all surrogate keys
Implement complete pieces of logic as early in the process stream
as possible, so downstream processes can benefit from it in the
most timely manner
Set processing may lead to seeking alternative storage options
Use a sounding board
116
Data Staging
The Data Staging Process is the iceberg of the data
warehouse project.
While an iceberg looks formidable from the ships helm,
we often dont gain a full appreciation of its magnitude
until we collide with it
So many challenges are buried in the data sources and the
systems they run on that this part of the process
invariable takes much more time than you expect.
The concepts and approach in this training apply to both
hand-coded staging systems and data staging tools FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
117
Data Staging
Takes data from the operational systems and
prepares it for dimensional model in the data
presentation area.
It is a backroom service and not a query service.
Unfortunately many teams focus on the E and L
of ETL
The E does have its challenges.
But most of the heavy lifting occurs in theT
118
Transformation
Combine data
Deal with quality issues
Identify updated data
Manage surrogate keys
Build aggregates
Handle errors FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
119
Getting Started
For once I will skip our primary mantra of focus on the
business requirements and present our second-favorite
aphorism
MAKE A PLAN
Do we need to use a Tool
You need to decide early
Do not expect to recoup your investment on the first
iteration due to the learning curve.
A tool would provide greater metadata integration and
enhanced flexibility, reusability, and maintainability in the
long run.
120
Dimensional Data Staging
Extract Dimensional Data from Operational
Systems
Cleanse attribute values
Name and address parsing
Inconsistent descriptive values
Missing decodes
Overloaded codes with multiple meaning over time
Invalid data
Missing data FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
121
Dimensional Data Staging
Manage surrogate key assignments
Since we maintain surrogate keys in the warehouse we
must maintain a persistent master cross-reference table
in the staging area for each dimension
The cross reference table keeps track of the surrogate
key assigned to an operational key at a point in time
along with the attribute profile.
We interrogate the extracted dimensional source data
to determine whether it is new dimension row, an
update to an existing row, or neither.
New records are identified easily because the
operational source key is not maintained in the master
cross reference table
122
master cross reference table
Most Recent Cyclic Redundancy checksum(CRC)
Most recent Dimension Row Indicator
Dimension row Expiration Date
Dimension row effective date
Dimension Attribute 1-N
Operational Source Key
Surrogate Dimension Key
Master Dimension Cross Reference table FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
123
To quickly determine if rows have changed, we rely on
cyclic redundancy checksum(CRC) algorithm.
If the CRC is identical for the extracted record and the
most recent row of the master cross- reference table, then
we ignore the extracted record
If the CRC differs then we need to study each column to
determine whats changed and then how the change will
be handled.
Type 1/ Type2/ Type 3
The final Step is to update the most recent surrogate key
assignment table.
This table consists of OS Keys and its most recent
assigned surrogate keys to act as a fast look up.
Dimensional Data Staging
124
Dimensional table Surrogate Key management
Source
Extract
CRC COMPARE
Master Dim
Cross-Ref
Assign surrogate
Keys & set
dates/Indicator
Ignore
Update
Prior most
recentrow
Assign surrogate
Keys & set
dates/Indicator
Update
Dimension
Master Dim
Cross-Ref
Most Recent
Key
Assignment
New Source
Rows
No CRC
CHANGE
CHANGED
Rows
Type
1 or 3
Type
1 or 3
Insert
Update
Update
Update
Insert FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
125
Dimension Data Staging
Build dimension row load images and publish
revuseddata
Once the dimension table reflects the most recent
extract(and has been confidently quality assured),
it is published to all data marts that use
dimensions.
126
Fact Table Staging
Extract fact data from operational sources
Receive updated dimensions from the dimension
authorities
Separate the fact data by granularity as required
Transform the fact table as required
Replace the operational source keys with surrogate
keys
We use the most recent surrogate key assignment table
created by the dimension authority to do this. FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
127
Fact Table Staging
Add additional keys for known context.
Quality assure the fact table data
Construct or update aggregation fact tables
Bulk load the data
Alert the users
128
Microsoft
owerPoint Presentatio FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
129
Smarter Business Intelligence
outsmarting to be number #1
Informatica Corporation
April 23, 2003
130
Business Imperatives
Changing markets forcing products to evolve or innovate
Changing competitive landscape forcing strategies to change
Changing economies forces organizations to contract and be effective
Changing financial drivers geared towards profitability
Changing market positioning to leadership to be NUMBER 1!
Forces all companies to think smarter than ever!
Application FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
131
Business Imperatives
Smarter.
Marketing campaigns
Products and positioning
Go-to-market strategies
Financial investments
Lead to Sales generation cycle
People!
132
Business Imperatives
The Challenge:
Making people think smarter
Expensive!
Impossible!
Not worth the effort! FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
133
Business Imperatives
The Solution:
Business Intelligence Initiatives
Enterprise Data Warehouse Project
Balance Scorecard Systems
EIS (Executive Information System) Project
Management Cockpit Infrastructure
Business Analytics Platform
134
Business Analytics Solutions Often Include
Multiple Tools And Technologies
Extract, transform and load
data into the warehouse
Data
Integration
Organize and store
transaction information
Data
Warehouse
Provide end-users with
reports and ad hoc access to
the data in the warehouse
Business
Intelligence FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
135
Informatica Business Analytics Suite
Modular Plug-&-Play Approach Offers Best of Buy and Build
136
Market Leaders Rely on Informatica
80%+ of the Fortune 100
80%+ of the Dow Jones Industrial Average
Global Reach
Entertainment - The 5 Largest
Telecommunications - 13 of the Top 14
Financial Services - 12 of the Top 15
Pharmaceutical - 12 of the Top 13
Utilities - 15 of the Top 20
Insurance - 16 of the Top 21
Manufacturing - 12 of the Top 16
All 4 branches of the US Armed Forces FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
137
Boosting productivity
By visually defining mappings and transformation through an easyto use
GUI , we have been able to significantly reduce data warehouse
maintenance and support costs. I n fact, we now only have one resource
managing a half-terabyte data warehouse.
Grady Boggs
Data Warehouse Manager
At Hewlett-Packard, we are always looking for innovate ways to leverage
technology to improve productivity and using I nformatica, we have seen an
over 75 percent improvement in development productivity and timeto
market.
Rudy Garza
Data Architect
We have achieved very rapid time-to-deployment with I nformatica, and the
resulting increase in our operational and analytic capabilities will drive
increased value and savings for Deluxe.Through automated replication
processes and streamlined workflow, we anticipates a $6 million annual
reduction in data-maintenance costs.
Andy Field
Senior Director
138
Thrifty improves productivity by over 75%
Challenge:
Systems difficult to maintain through lack of updated and accurate
records of how, why, and where data was transferred
Heavy reliance on code resulted in limited transformation capabilities and
flexibility to deal with changes in business requirements
Develop a metadata strategy promoting reuse proved to be difficult
Solution:
Single console for design, development, testing, daily management, scheduling, and
smart recovery after failed components
Simple operation, and evolution
Object-oriented, user friendly interface with over 100 built in transformations and
robust visual debugger
Use of wizards to visually go through error-prone and repetitive tasks
Results:
Integrated product suite enables rapid development and time to market
Active and automated metadata solution, promoting reuse
ROI in under a year FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
139
Delivering on the Performance Promise
One of the main drivers behind the success of our very high performance,
highly scalable enterprise data warehouse has been the performance and
scalability of PowerCenter.PowerCentersperformance gives us the
confidence to scale our data warehouse into the 10-20 Terabyte range in
the years ahead.
Mark Cothron
Data Warehouse Architect
I nformatica's performance capabilities and scalability immediately lifted it
over the competition.Using I nformatica we have created a multi-terabyte
data warehouse and the analysis and action-enabling information this
system provides has given us a competitive advantage that can't be
matched.
Patrick Firouzian
Director
140
PepsiCo creates 3 data warehouses in
excess of 1 TB
Informaticas performance has been superb and we have seen drastic
improvements with each new release.We are always looking to get information
into the hands of our business users quicker and more efficiently and using
Informatica we have over 30 data integration projects, with the largest being a 7
Terabyte data warehouse.
Wendy Faegre
Systems Manager
!Results:
#Largest data warehouse > 7 TB and easily loads in 3 hour batch window
#Process over 60 GBs daily and 800 GBs monthly
#throughput exceed 30 GB/hour
#70 % improvement in performance over code FOR STUDENT REFERENCE ONLY
TRAINER:- CHRISTOPHER RICHARD
141
Informatica Overview
Corporate
! Founded (1993); Nasdaq: INFA (1999)
! Over 800 employees worldwide
Financials
! 2000: $154 million revenue
! 2001: $197 million revenue
! 2002: $195 million revenue
Partners
! Over 200 sales, marketing and implementation partners
! Including: i2, PeopleSoft, Big 5, Siebel, SAP; Mitsubishi
Products
! Industry-leading solutions for deploying business
analytics across the enterprise:
- Data integration - Data Warehouses
- Business Intelligence - Analytic Applications
Customers
! Over 1700 worldwide
! 80 of the Fortune 100 and 80% of Dow Jones

Das könnte Ihnen auch gefallen